MLLMs에서 Perception, Confidence, Accuracy 연결하기

Linking Perception, Confidence and Accuracy in MLLMs

Mar 12, 2026•Yuetian Du, Yucheng Wang, Rongyu Zhang +5•View PDF

TL;DR Highlight

멀티모달 LLM이 이미지가 흐릿해도 자신감을 잃지 않는 버그를 발견하고, RL로 고치는 방법과 이를 활용한 Test-Time Scaling 프레임워크를 제안.

Who Should Read

멀티모달 AI 서비스에서 hallucination(환각) 문제를 줄이려는 ML 엔지니어, 또는 비전 모델의 신뢰도 점수를 실제 정확도와 연동하고 싶은 연구자.

Core Mechanics

기존 MLLM(멀티모달 LLM)은 이미지에 노이즈를 점점 추가해도 confidence(자신감 점수)가 거의 안 떨어짐 — 정확도는 급락하는데. 이게 'confidence miscalibration' 문제
CDRL(Confidence-Driven Reinforcement Learning): 원본-노이즈 이미지 쌍을 학습에 사용하고, '틀렸으면 자신감 낮추고 맞았으면 높이도록' 보상 함수 설계. Qwen2.5-VL-7B 기반으로 full-parameter fine-tuning
CA-TTS(Confidence-Aware Test-Time Scaling): 추론 시 confidence 신호 따라 Self-Consistency, Self-Reflection, Self-Check 3개 모듈을 Expert Model(Gemini-2.5-Pro)이 동적으로 스케줄링
Self-Check 모듈은 원본-노이즈 이미지 쌍으로 VCD(Visual Contrastive Decoding)를 적용해 시각적 근거를 검증 — 텍스트가 아닌 이미지 레벨에서 자기 검증
CDRL로 calibrate된 confidence는 test-time scaling에 'free lunch'로 작동: 샘플 수 늘릴수록 성능 향상 slope가 Majority Voting 대비 2.2배, DeepConf 대비 3.1배 가파름
4개 벤치마크(Math-Vista, Math-Vision, MMStar, MMMU) 전체 평균 8.8% 향상, Math-Vision N=32 기준 48.44% vs 베이스라인 34.41%

Evidence

4개 벤치마크 전체 평균 8.8% 향상: Math-Vista 79.5%, Math-Vision 42.4%, MMStar 71.3%, MMMU 66.3% (VL-Rethinker 대비 Math-Vista +0.7%, MMMU +0.7%)
Test-time scaling slope: CA-TTS β=3.65 vs Majority Voting β=1.64 vs DeepConf β=1.19, N=32 도달 시 45%+ vs 베이스라인 ~35% 정체
CDRL 학습 후 visual perturbation에 대한 confidence drop이 4~8배 향상: Noised -0.32→-1.39, Occlusion -0.24→-1.13, Viewpoint +0.09→-1.29
CA-TTS는 Majority Voting 대비 추론 시간 0.91배 추가(11.03s vs 5.76s)로 평균 정확도 8.4% 향상, Qwen3-VL-2B-Thinking에도 적용 시 Overall Avg 71.09→74.41

How to Apply

기존 Majority Voting 파이프라인을 CA-TTS로 교체할 때: Expert Model로 Gemini-2.5-Pro 또는 GPT-5를 Planner/Voter/Critic 3역할로 설정하고, 낮은 confidence 응답에만 Self-Reflection을 트리거하도록 threshold 기반 라우팅 추가
비전 모델 fine-tuning 시 CDRL 방식 적용: 학습 이미지마다 CLIP attention map으로 핵심 영역에 노이즈를 추가한 쌍을 만들고, '정답이면 confidence 보상, 오답이면 confidence 페널티' 보상 항을 기존 RL 보상에 추가
프로덕션 RAG/Agent 파이프라인에서 응답 신뢰도 필터링: 모델 output token들의 mean negative log-probability를 confidence 점수로 계산하고, 점수가 낮은 응답에 대해서만 re-sampling 또는 human review 트리거

Code Example

snippet

# CA-TTS Self-Consistency + Confidence Voting 핵심 로직 예시
import numpy as np

def compute_confidence(logprobs: list[float]) -> float:
    """NMLP 기반 confidence 계산 (낮을수록 더 확실)"""
    return -np.mean(logprobs)  # Negative Mean Log-Probability

def confidence_weighted_voting(samples: list[dict]) -> dict:
    """
    samples: [{'answer': 'A', 'logprobs': [...], 'confidence': float}, ...]
    """
    vote_dict = {}
    for s in samples:
        ans = s['answer']
        conf = s['confidence']
        # 낮은 NMLP = 높은 확신 → 가중치로 사용
        weight = 1.0 / (conf + 1e-8)
        vote_dict[ans] = vote_dict.get(ans, 0) + weight
    return vote_dict

# Critic Expert Prompt (Self-Reflection 단계)
CRITIC_PROMPT = """
Given the following information:
Image: {image}
Question: {question}
Model Answer: {model_answer}
Model Confidence: {confidence}

Please generate a self-reflection critique.
Critique: Based on this question, your answer is "{model_answer}", 
<fill in concise critique here>
"""

# Voter Expert Prompt (Self-Consistency 단계)
VOTER_PROMPT = """
Image: {image}
Question: {question}
Candidate options: {options_list}

Generate normalized confidence (probability) for each option.
Sum must equal 1. Output ONLY the array:
[p_1, p_2, ..., p_n]
"""

# 전체 CA-TTS 플로우 (pseudo)
def ca_tts(image, question, base_model, expert_model, n_samples=8):
    # 1. n개 샘플 생성 + confidence 계산
    samples = []
    for _ in range(n_samples):
        output, logprobs = base_model.generate(image, question)
        conf = compute_confidence(logprobs)
        samples.append({'answer': output, 'confidence': conf})
    
    # 2. Self-Consistency: confidence weighted voting
    vote_dict = confidence_weighted_voting(samples)
    
    # 3. Expert Voter 외부 검증
    candidates = list(set(s['answer'] for s in samples))
    expert_probs = expert_model.vote(image, question, candidates, VOTER_PROMPT)
    tau1 = 0.5
    for k, p in zip(candidates, expert_probs):
        vote_dict[k] = vote_dict.get(k, 0) + tau1 * p
    
    # 4. Self-Reflection: low confidence 응답 교정
    low_conf_sample = max(samples, key=lambda x: x['confidence'])  # 높은 NMLP = 낮은 확신
    critique = expert_model.critique(image, question, low_conf_sample, CRITIC_PROMPT)
    reflected_answer, _ = base_model.generate(image, question, context=critique)
    tau2 = 0.5
    vote_dict[reflected_answer] = vote_dict.get(reflected_answer, 0) + tau2
    
    # 5. 최다 득표 답변 반환
    return max(vote_dict, key=vote_dict.get)

Terminology

MLLM텍스트와 이미지를 동시에 처리하는 멀티모달 LLM. GPT-4V, Gemini처럼 이미지를 보고 질문에 답하는 모델.

confidence miscalibration모델이 '나 이거 확실해'라고 말하지만 실제로는 틀리는 현상. 시험 잘 모르는데 자신만만하게 찍는 것과 같음.

GRPO그룹 상대 정책 최적화(Group Relative Policy Optimization). 여러 후보 답변을 동시에 생성하고 서로 비교해서 좋은 답변을 강화하는 RL 학습법.

Test-Time Scaling모델을 재학습하지 않고 추론할 때 계산을 더 많이 써서 성능을 높이는 방법. 시험 시간을 늘려서 더 꼼꼼히 검토하는 것과 비슷.

VCDVisual Contrastive Decoding. 원본 이미지와 노이즈 이미지로부터 나온 확률 차이를 이용해 이미지에 실제로 근거한 답을 골라내는 기법.

Self-Consistency같은 질문에 여러 번 답변을 생성하고 가장 많이 나온 답변을 최종 답으로 채택하는 방법. 여러 사람의 다수결로 정답을 고르는 것.

NMLPNegative Mean Log-Probability. 모델이 각 단어를 얼마나 확신하는지의 평균. 값이 낮을수록 더 확신하는 상태.

HallucinationLLM이 이미지나 사실과 다른 내용을 그럴듯하게 생성하는 현상. 없는 것을 있다고 자신있게 말하는 AI의 '환각'.

Related Resources

CA-TTS GitHub Repository

Original Abstract (Expand)

Recent advances in Multi-modal Large Language Models (MLLMs) have predominantly focused on enhancing visual perception to improve accuracy. However, a critical question remains unexplored: Do models know when they do not know? Through a probing experiment, we reveal a severe confidence miscalibration problem in MLLMs. To address this, we propose Confidence-Driven Reinforcement Learning (CDRL), which uses original-noise image pairs and a novel confidence-based reward to enhance perceptual sensitivity and robustly calibrate the model's confidence. Beyond training benefits, calibrated confidence enables more effective test-time scaling as a free lunch. We further propose Confidence-Aware Test-Time Scaling (CA-TTS), which dynamically coordinates Self-Consistency, Self-Reflection, and Visual Self-Check modules guided by confidence signals. An Expert Model acts in multiple roles (e.g., Planner, Critic, Voter) to schedule these modules and provide external verification. Our integrated framework establishes new state-of-the-art results with consistent 8.8% gains across four benchmarks. More ablation studies demonstrate the effectiveness of each module and scaling superiority.