LLM이 Confidence 신호를 사용해 행동을 제어한다는 인과적 증거

Causal Evidence that Language Models use Confidence to Drive Behavior

Mar 23, 2026•Dharshan Kumaran, Nathaniel Daw, Simon Osindero +2•View PDF

TL;DR Highlight

GPT-4o, Gemma 3 27B 등 주요 LLM이 내부 confidence 신호를 실제로 사용해 답변 여부를 결정한다는 인과적 증거를 4단계 실험으로 밝혔다.

Who Should Read

LLM의 hallucination 감소나 불확실한 답변 거부(abstention) 기능을 프로덕션에 적용하려는 AI 엔지니어. LLM의 메타인지 능력을 이해하고 신뢰도 높은 AI 시스템을 설계하려는 ML 연구자나 시스템 아키텍트.

Core Mechanics

GPT-4o는 명시적 지시 없이도 내부 confidence가 약 77% 이하로 떨어지면 자동으로 답변을 거부(abstain)하는 암묵적 임계값을 보유함
confidence가 abstention을 예측하는 효과 크기(βstd=0.99)는 RAG 점수, 문장 임베딩, 문제 난이도보다 약 10배 크게 나타남 — confidence가 압도적인 예측 변수
Gemma 3 27B에 activation steering(내부 뉴런 활성화를 직접 조작하는 기법)으로 high/low confidence 벡터를 주입하자 abstention 비율이 66.5%에서 7.0%까지 59.5%p 변화함 — confidence가 행동을 '인과적으로' 유발한다는 직접 증거
mediation analysis(인과 경로 분석) 결과, activation steering 효과의 67.1%는 confidence 재분배(abstention 확률 → 답변 확률로 이동), 26.2%는 decision policy 변화를 통해 전달됨
Phase 4에서 모델에게 명시적 confidence threshold(0~100%)를 지시하면 abstention 비율이 그에 맞게 조정되며, Phase 1에서 측정한 사전 confidence가 여전히 강력한 예측 변수로 남음 — Stage 1(confidence 형성)과 Stage 2(임계값 정책)가 독립적으로 작동
GPT-4o는 내부 confidence를 지시된 threshold보다 1.8배 더 무겁게 가중치를 두며, 모델마다 abstention 기준이 크게 다름(DeepSeek 82%, GPT-4o 56.6%, Qwen 80B 43.8%, Gemma 3 27B 27.2%)

Evidence

Activation steering으로 abstention 비율이 최대 low-confidence(-2)에서 66.5%, 최대 high-confidence(+2)에서 7.0%로, 59.5%p 차이 발생 (r=-0.99, p<0.001)
Phase 2 logistic regression에서 confidence의 표준화 효과 크기 |βstd|=0.99로, RAG(0.102), 난이도(0.110), 임베딩(0.106)보다 약 9~10배 큼
Phase 4에서 confidence+threshold 모델이 threshold-only 모델 대비 AIC 1953점 감소, pseudo-R² 0.11→0.24로 향상 (χ²(1)=1955, p<0.001)
Mediation analysis에서 confidence 재분배 간접 경로가 전체 steering 효과의 67.1% 설명 (간접 효과 a1×b1=-0.55, 95% CI [-0.65, -0.47])

How to Apply

프로덕션 LLM에서 답변 신뢰도를 높이려면, Phase 4 방식처럼 프롬프트에 'confidence가 T% 미만이면 답변 거부' 지시를 추가하면 됨 — threshold를 10~20% 높일 때마다 accuracy가 약 1.1% 향상되고 abstention이 늘어나는 tradeoff를 활용
모델이 자체 프롬프트 지시를 잘 안 따르는 경우(Gemma 3 27B처럼 abstention rate가 5% 미만), 동일 의미의 프롬프트 20개를 생성해 abstention 비율이 가장 높은 버전을 선택하는 prompt paraphrasing 전략이 효과적
내부 activation에 접근 가능한 오픈소스 모델(예: Gemma 3 27B)을 사용한다면, high/low confidence trial의 residual stream을 대조해 steering 벡터를 만들고 inference 시 레이어 30~40에 주입하면 abstention 비율을 소프트웨어 수준에서 조절 가능

Code Example

snippet

# Phase 4 방식: 프롬프트에 confidence threshold 지시 추가
prompt_template = """
You will be given a 4-way multiple choice question, with options 1-4.
First, internally estimate the probability (0-100) that your answer is correct.
Then:
- If your confidence is MORE than {threshold}%, output only the number of your answer.
- If your confidence is LESS than {threshold}%, output '5' to indicate you want to abstain.
Output only a single number.

Question: {question}
1) {option1}
2) {option2}
3) {option3}
4) {option4}

Answer:"""

# threshold를 0~100 사이에서 조절하면 accuracy/coverage tradeoff 제어 가능
# 높은 threshold(예: 80%) → accuracy↑, coverage↓
# 낮은 threshold(예: 30%) → accuracy↓, coverage↑
for threshold in range(0, 101, 10):
    prompt = prompt_template.format(
        threshold=threshold,
        question="Who won the Nobel Prize in Physics in 2024?",
        option1="Geoffrey Hinton",
        option2="Yann LeCun",
        option3="John Hopfield",
        option4="Andrew Ng"
    )
    # response = llm.generate(prompt)
    # '5'이 반환되면 abstention, 1-4이면 해당 답변

Terminology

activation steering모델 내부의 뉴런 활성화 값을 inference 중에 직접 더하거나 빼서 모델 행동을 바꾸는 기법. 모델의 '생각 흐름'에 직접 개입하는 것으로, 프롬프트 없이도 행동을 조종 가능.

abstention모델이 답변을 하지 않고 '모르겠다' 또는 '답변 거부'를 선택하는 행동. 틀린 답을 내놓는 것보다 안전한 상황에서 사용됨.

metacognition자기 자신의 사고를 인식하고 평가하는 능력. '내가 이 문제를 얼마나 잘 알고 있는지 스스로 판단하는 것'과 같음.

calibration모델이 '80% 확신한다'고 했을 때 실제로 80% 확률로 맞는지를 측정하는 것. 날씨 예보가 '70% 비 예보'를 했을 때 실제로 70% 비율로 비가 오는지와 같은 개념.

logprob모델이 각 토큰을 선택할 확률의 로그값. 모델이 얼마나 확신하는지를 수치로 나타내는 기본 신호.

mediation analysisA가 B에 영향을 줄 때, 실제로는 A→C→B 경로를 통해서인지를 통계적으로 분해하는 분석법. '스트레스(A)가 건강(B)을 해치는 이유가 수면 부족(C) 때문인지'를 밝히는 것과 같음.

temperature scaling모델의 confidence 점수를 실제 정확도와 일치하도록 보정하는 후처리 기법. 모델이 '90% 확신'이라고 해도 실제로 70%만 맞는다면, 스케일을 조정해 실제 확률에 맞게 교정하는 것.

residual streamTransformer 모델에서 각 레이어를 거치면서 정보가 누적되는 내부 벡터 통로. 레이어들이 이 흐름에 정보를 추가하거나 수정하면서 최종 출력을 만들어냄.

Original Abstract (Expand)

Metacognition -- the ability to assess one's own cognitive performance -- is documented across species, with internal confidence estimates serving as a key signal for adaptive behavior. While confidence can be extracted from Large Language Model (LLM) outputs, whether models actively use these signals to regulate behavior remains a fundamental question. We investigate this through a four-phase abstention paradigm.Phase 1 established internal confidence estimates in the absence of an abstention option. Phase 2 revealed that LLMs apply implicit thresholds to these estimates when deciding to answer or abstain. Confidence emerged as the dominant predictor of behavior, with effect sizes an order of magnitude larger than knowledge retrieval accessibility (RAG scores) or surface-level semantic features. Phase 3 provided causal evidence through activation steering: manipulating internal confidence signals correspondingly shifted abstention rates. Finally, Phase 4 demonstrated that models can systematically vary abstention policies based on instructed thresholds.Our findings indicate that abstention arises from the joint operation of internal confidence representations and threshold-based policies, mirroring the two-stage metacognitive control found in biological systems. This capacity is essential as LLMs transition into autonomous agents that must recognize their own uncertainty to decide when to act or seek help.