Clustered Self-Assessment: LLM 불확실성 정량화를 위한 간단하고 효과적인 방법

TL;DR Highlight

LLM이 여러 답변을 의미 단위로 묶어 객관식으로 만들고 스스로 채점해서 '이 답 얼마나 확신해?'를 수치로 뽑아내는 기법.

Who Should Read

LLM 응답의 신뢰도를 사용자에게 보여주고 싶은 백엔드/ML 엔지니어. 특히 헬스케어, 법률, 리서치 도메인에서 할루시네이션 감지 파이프라인을 만드는 개발자.

Core Mechanics

LLM에게 여러 번 답변을 생성하게 한 뒤 NLI 모델(문장 간 의미 관계를 판단하는 모델)로 비슷한 답끼리 클러스터링하고, 이를 객관식 보기로 변환해서 모델 스스로 정답을 고르게 한다.
모델이 선택지를 고를 때 나오는 토큰 확률을 그대로 신뢰도 점수로 쓰는데, 이 숫자가 실제 정답률과 잘 맞아서 사람이 직관적으로 해석할 수 있다.
클러스터링이 핵심인 이유: 비슷한 의미의 답변이 별도 선택지로 쪼개지면 확률이 분산돼서 신뢰도 측정이 망가지는 걸 방지한다.
'None of the above' 옵션을 항상 마지막에 추가해서 모든 선택지가 틀렸을 경우도 처리한다.
훈련 없이 바로 쓸 수 있는 방법인데도, 학습 기반 Probe(LLM 내부 hidden state로 불확실성을 예측하는 분류기)의 학습 신호로도 활용 가능하다.
폐쇄형(closed-source) 모델에는 logit 접근이 안 돼서 못 쓰고, NLI 외부 모델 의존성이 있다는 한계가 있다.

Evidence

TriviaQA 데이터셋에서 Qwen2.5-32B 기준 AUROC 0.940로, 2위 Probability(0.883) 대비 약 6.5% 향상. 비교 대상 14개 베이스라인 전체에서 1위.
샘플 효율성: 추가 샘플 2개만 써도 AUROC 0.933(TQA)으로, 16개 샘플을 쓴 SAR(0.884)나 Semantic Entropy(0.868)보다 높다.
Calibration(보정 품질) 지표인 Brier Score에서도 TQA 기준 Qwen2.5-32B에서 0.0843으로 P(True)(0.1172), Probability(0.2267)보다 크게 우수.
Ablation: 클러스터링 제거 시 NQ AUROC가 Qwen2.5-32B 기준 0.850→0.741로 하락, 샘플링 제거(=P(True) 동일) 시 0.850→0.785로 하락해 두 구성요소 모두 필수.

How to Apply

LLM API 응답 후 동일 질문을 temperature=0.5로 2~8번 추가 샘플링하고, deberta-large-mnli로 답변 간 entailment 관계를 판단해 클러스터를 묶은 뒤 MCQ 프롬프트로 재질의하면 된다. logit 접근이 가능한 오픈소스 모델(Qwen, Gemma 계열)에 바로 적용 가능.
RAG 파이프라인에서 검색 결과 기반 답변 신뢰도를 점수화할 때, 이 confidence score를 threshold(예: 0.5 미만이면 '불확실' 표시)로 사용자에게 노출하거나, 낮은 신뢰도 답변은 사람 검토 큐에 넣는 로직을 추가할 수 있다.
Probe 학습 시나리오: 이 방법으로 생성한 confidence score를 soft label로 써서 LLM hidden state → 불확실성 예측 로지스틱 회귀 모델을 학습하면, 나중엔 추가 샘플링 없이 단일 forward pass로 신뢰도를 예측할 수 있다.

Code Example

snippet

# Clustered Self-Assessment 핵심 흐름 (pseudo-code)
from transformers import pipeline

# 1. NLI 모델 로드
nli = pipeline("text-classification", model="cross-encoder/nli-deberta-v3-large")

def cluster_answers(answers):
    """NLI로 의미가 같은 답변끼리 묶기"""
    clusters = []
    for ans in answers:
        placed = False
        for cluster in clusters:
            rep = cluster[0]
            # 양방향 entailment 확인
            r1 = nli(f"{rep} [SEP] {ans}")[0]
            r2 = nli(f"{ans} [SEP] {rep}")[0]
            labels = {r1['label'], r2['label']}
            if 'CONTRADICTION' not in labels or 'ENTAILMENT' in labels:
                cluster.append(ans)
                placed = True
                break
        if not placed:
            clusters.append([ans])
    return clusters

def build_mcq(question, clusters):
    """클러스터 대표값으로 객관식 구성"""
    choices = [c[0] for c in clusters]
    labels = [chr(65 + i) for i in range(len(choices))]  # A, B, C...
    last_label = chr(65 + len(choices))
    
    prompt = f"""Task:
Select the one correct answer to the question from the choices provided.
If none of the provided choices is correct, select the final choice ({last_label}) None of the above.

Question:
{question}

Choices:
"""
    for label, choice in zip(labels, choices):
        prompt += f"({label}) {choice}\n"
    prompt += f"({last_label}) None of the above\n\nAnswer:\nThe answer is ("
    return prompt, labels

# 2. 사용 예시
question = "Where is the Eiffel Tower?"
# greedy 답변 + temperature 샘플링 2개
sampled_answers = ["Paris", "It's Paris", "Rome"]

clusters = cluster_answers(sampled_answers)
# clusters = [["Paris", "It's Paris"], ["Rome"]]

mcq_prompt, labels = build_mcq(question, clusters)
print(mcq_prompt)
# -> (A) Paris  (B) Rome  (C) None of the above

# 3. LLM에 MCQ 질의 후 첫 번째 선택지(A = greedy 답변 클러스터)의 토큰 확률을 confidence score로 사용
# confidence_score = logit_prob[token_A]  # 모델의 logit에서 직접 추출

Terminology

AUROC모델이 정답/오답을 얼마나 잘 구분하는지 나타내는 점수. 1.0이면 완벽, 0.5면 랜덤 찍기와 동일.

NLINatural Language Inference. 두 문장이 서로 함의(같은 말), 모순(반대 말), 중립 중 어떤 관계인지 분류하는 모델.

Semantic EntropyLLM이 같은 질문에 여러 번 답할 때 답변들의 의미 다양성으로 불확실성을 측정하는 기법. 답이 매번 달라지면 불확실하다고 보는 방식.

Brier Score예측 확률이 실제 정답과 얼마나 가까운지 측정하는 점수. 낮을수록 좋음. 기상 예보에서 '비 올 확률 70%'가 실제와 얼마나 맞는지 측정하는 것과 같은 원리.

ProbeLLM 내부의 중간 레이어 hidden state(모델이 답을 생성하는 과정에서 나오는 내부 벡터)를 입력으로 받아 특정 속성을 예측하는 작은 분류기.

LogitLLM이 다음 토큰을 예측할 때 각 단어에 부여하는 원시 점수. softmax를 거치면 0~1 사이 확률로 변환됨.

MCQMultiple Choice Question. 객관식 문제. 이 논문에서는 LLM이 스스로 자신의 답변 중 뭐가 맞는지 고르게 하는 용도로 씀.

Calibration모델이 '70% 확신한다'고 할 때 실제로 70% 확률로 맞는지를 나타내는 일치도. 캘리브레이션이 좋으면 confidence score를 믿을 수 있음.

Related Resources

GitHub - Clustered Self-Assessment 코드

Original Abstract (Expand)

Large language models (LLMs) demonstrate remarkable performance across diverse tasks, but they often generate responses that appear plausible while being factually incorrect. This problem is compounded by the lack of explicit uncertainty estimates, which makes it difficult for users to judge the reliability of model outputs. Existing uncertainty quantification methods typically rely on indirect signals, such as entropy across sampled generations. These signals can be difficult to interpret and do not fully leverage the model's ability to assess its own uncertainty. We propose a simple yet effective self-assessment method for uncertainty quantification in LLMs. Our approach groups sampled generations into semantically distinct clusters, converts them into answer options in a structured multiple-choice question, and uses the probability assigned by the LLM to each option as a confidence estimate. Experiments across multiple models and datasets show that our method consistently outperforms baseline approaches. Notably, it achieves competitive performance with as few as two additional samples, demonstrating both its effectiveness and efficiency.