Out-of-Distribution Detection으로 Hallucination Detection 재해석: 기하학적 관점

From Out-of-Distribution Detection to Hallucination Detection: A Geometric View

Feb 6, 2026•Litian Liu, Reza Pourreza, Yubin Jian +2•View PDF

TL;DR Highlight

LLM이 헛소리하는지 추가 학습 없이 단일 샘플만으로 탐지하는 방법 — OOD 탐지를 환각 감지에 그대로 접목.

Who Should Read

LLM 서빙 파이프라인에서 환각(hallucination)을 실시간으로 감지해야 하는 ML 엔지니어. 특히 추론(reasoning) 태스크에서 모델 신뢰도를 높이려는 개발자.

Core Mechanics

LLM의 next-token prediction을 분류 문제로 보면 OOD 탐지 기법을 그대로 가져올 수 있음 — 환각은 OOD 샘플과 구조적으로 같음
NCI(신경망 붕괴 기반, Neural Collapse-inspired): 마지막 레이어 직전 임베딩이 가중치 벡터에 얼마나 가까운지 측정 — 멀수록 불확실, 환각 가능성 높음
fDBD(결정 경계 기반): 임베딩이 결정 경계(decision boundary)에 가까울수록 불확실 — 환각 응답은 경계에 더 붙어있음
학습 데이터 없이 모델 가중치만으로 훈련 통계량(평균 feature)을 수학적으로 추정 가능 — zero-bias 헤드면 그냥 원점(0)으로 근사해도 됨
어휘 크기가 수십만 개라 전체 계산은 비효율적 → 상위 k개 후보 토큰만 보는 방식으로 정확도·속도 동시 개선
여러 샘플 반복 생성이 필요한 SelfCheckGPT, Semantic Entropy 같은 기존 방법들보다 추론 태스크에서 더 강함 — 다양한 추론 경로 때문에 일관성 비교가 원래 불리했던 것

Evidence

Llama-3.2-3B, CSQA 기준: fDBD AUROC 69.24 vs Perplexity 63.23, SelfCheckGPT NLI 64.18 (단일 샘플 방식 중 최고)
Qwen-2.5-7B, CSQA 기준: fDBD AUROC 72.47 vs 최고 baseline P(True) 68.01 — 4.5p 차이
Qwen-3-32B(32B 대형 모델)에서도 fDBD가 GSM8K AUROC 80.60으로 Perplexity 76.86 초과 — 대형 모델에도 확장됨
stochastic 디코딩(temperature 0.2~1.0) 전체에서 NCI·fDBD가 Perplexity 대비 일관되게 3~6p 높은 AUROC 유지

How to Apply

모델 forward pass 시 penultimate layer(마지막 레이어 직전) hidden state를 추출하고, 언어 헤드 가중치와의 거리를 계산해서 토큰별 불확실도 점수를 만든다 — 각 스텝 점수를 평균 내면 시퀀스 단위 환각 점수가 됨
어휘 크기가 크면 상위 k=100~1000개 토큰만 대상으로 거리 계산 — k 변화에 AUROC 변화가 미미해서(±20% 변경 시 0.1p 이내) 대략적인 값 설정으로 충분
zero-bias language head를 쓰는 모델(Llama 계열 등)은 훈련 feature 평균을 원점(0 벡터)으로 근사해도 경험적 추정보다 오히려 성능이 좋음 — 별도 calibration 데이터 불필요

Code Example

snippet

import torch

def compute_fDBD_score(hidden_states, lm_head_weight, k=1000):
    """
    fDBD 기반 환각 탐지 점수 (높을수록 환각 아님)
    hidden_states: [seq_len, hidden_dim] - penultimate layer 출력
    lm_head_weight: [vocab_size, hidden_dim]
    """
    seq_scores = []

    for z in hidden_states:  # 각 토큰 생성 스텝
        logits = lm_head_weight @ z  # [vocab_size]
        c_hat = logits.argmax().item()
        w_hat = lm_head_weight[c_hat]

        # 상위 k+1개에서 c_hat 제외
        topk_vals, topk_idx = logits.topk(k + 1)
        alt_indices = topk_idx[topk_idx != c_hat][:k]

        # 결정 경계까지 거리 계산
        distances = []
        for c in alt_indices:
            w_c = lm_head_weight[c]
            logit_diff = logits[c_hat] - logits[c]
            w_diff_norm = (w_hat - w_c).norm()
            dist = logit_diff / (w_diff_norm + 1e-8)
            distances.append(dist.item())

        z_norm = z.norm().item() + 1e-8
        step_score = sum(distances) / (len(distances) * z_norm)
        seq_scores.append(step_score)

    return sum(seq_scores) / len(seq_scores)  # 낮으면 환각 의심

# 사용 예시 (Hugging Face 모델 기준)
# outputs = model(input_ids, output_hidden_states=True)
# hidden = outputs.hidden_states[-2]  # penultimate layer
# weight = model.lm_head.weight
# score = compute_fDBD_score(hidden[0], weight, k=1000)
# threshold = 0.5  # validation set으로 튜닝

Terminology

OOD (Out-of-Distribution)학습 때 못 본 종류의 데이터가 들어왔을 때 이를 잡아내는 기법. 강아지/고양이만 배운 분류기에 새가 들어왔을 때 '모르는 것'이라고 인식하는 것과 같음.

HallucinationLLM이 사실이 아닌 내용을 마치 사실인 것처럼 자신 있게 생성하는 현상. 모델이 '모른다'고 말하지 않고 그럴듯한 거짓말을 만들어내는 것.

AUROC탐지 성능을 측정하는 지표 (0.5~1.0, 높을수록 좋음). 0.5는 랜덤 수준, 1.0은 완벽한 탐지. 임계값 없이 전반적 성능을 평가할 때 씀.

Penultimate layer모델의 마지막 출력 레이어 바로 직전 레이어. 여기서 나오는 벡터(임베딩)가 모델의 '생각'을 압축해서 담고 있음.

Decision Boundary분류기가 '이건 A, 저건 B'로 구분하는 경계선. 임베딩이 이 경계에 가까울수록 어느 쪽인지 불확실한 상태 — 환각이 일어나기 쉬운 구간.

NCI (Neural Collapse Inspired)잘 학습된 신경망에서 각 클래스의 임베딩이 해당 가중치 방향으로 수렴하는 현상(Neural Collapse)을 이용한 OOD 탐지기. 이 방향과 멀수록 불확실.

Greedy Decoding vs Stochastic DecodingGreedy는 매 스텝 가장 확률 높은 토큰만 선택(결정론적), Stochastic은 확률 분포에서 랜덤 샘플링(temperature로 제어). 둘 다 이 방법으로 환각 탐지 가능.

Original Abstract (Expand)

Detecting hallucinations in large language models is a critical open problem with significant implications for safety and reliability. While existing hallucination detection methods achieve strong performance in question-answering tasks, they remain less effective on tasks requiring reasoning. In this work, we revisit hallucination detection through the lens of out-of-distribution (OOD) detection, a well-studied problem in areas like computer vision. Treating next-token prediction in language models as a classification task allows us to apply OOD techniques, provided appropriate modifications are made to account for the structural differences in large language models. We show that OOD-based approaches yield training-free, single-sample-based detectors, achieving strong accuracy in hallucination detection for reasoning tasks. Overall, our work suggests that reframing hallucination detection as OOD detection provides a promising and scalable pathway toward language model safety.