LLM의 Delusion: 높은 확신으로 틀리는 환각의 더 위험한 형태

Delusions of Large Language Models

Mar 9, 2025•Hongshen Xu, Zi Yang, Zichen Zhu +7•View PDF

TL;DR Highlight

LLM이 틀린 답을 높은 확신으로 내뱉는 'Delusion' 현상은 일반 환각보다 훨씬 잡기 어렵고, 파인튜닝이나 자기반성으로도 고쳐지지 않는다.

Who Should Read

LLM 기반 서비스에서 환각(hallucination) 문제를 다루는 AI 엔지니어나 프로덕트 빌더. 특히 RAG나 멀티에이전트 시스템으로 신뢰도를 높이려는 개발자.

Core Mechanics

Hallucination과 다른 새 개념 'Delusion' 정의: 틀린 답인데 모델 스스로 높은 확신(confidence)을 가지는 경우. 정답에 대한 평균 confidence를 threshold로 삼아 분류.
Qwen2.5-7B 기준 TriviaQA에서 오답의 20~78%가 Delusion으로 분류됨. 모델 크기를 72B로 키워도 오답 중 delusion 비율은 크게 줄지 않음.
'I don't know'로 거부하도록 프롬프트해도 Delusion은 일반 Hallucination보다 훨씬 거부율이 낮음. 모델이 delusion을 더 강하게 믿음.
SFT(파인튜닝)로 '모르는 건 거부'하도록 학습시켜도 Delusion의 거부율이 Hallucination보다 낮게 유지됨. 파인튜닝으로 고치기 어려움.
Self-reflection(이전 답 재검토) 프롬프트를 줘도 Delusion은 그대로 고집하는 비율이 Hallucination보다 높음.
학습 데이터 노이즈가 Delusion을 만든다: 같은 오답이 반복적으로 학습 데이터에 포함될수록(noise intensity 높을수록) Delusion 비율 급증.

Evidence

RAG 적용 시 Llama-3.1-8B의 delusion 비율 7.3% → 2.6% (−64.7%), Qwen2.5-7B 8.3% → 2.7% (−67.8%) 감소.
Multi-agent voting(3개 모델 만장일치 기준)으로 Mistral-7B의 delusion 14.6% → 1.3% (−91.3%) 감소.
학습 데이터에서 delusion 유발 유사 샘플 제거 후 재학습 시 delusion ratio 71.3% 감소 (Table 2: 9.9% → 3.8%).
SFT refuse data 90%로 학습해도 Llama-3.1-8B의 hallucination 거부율 99.1%인 반면 delusion 거부율은 96.2%로 차이 유지.

How to Apply

RAG 파이프라인을 쓰고 있다면 delusion 감소에도 효과적. 검색 passage 20개 수준으로 충분히 grounding하면 delusion을 ~65% 줄일 수 있음.
중요한 답변에 멀티에이전트 voting을 추가하는 경우, 3개 모델 중 2개 이상 동의할 때만 답을 신뢰하는 방식으로 delusion을 80% 이상 줄일 수 있음.
파인튜닝 데이터 품질 관리 시, 같은 오답이 여러 샘플에서 반복되는지 확인하고 cosine similarity > 0.9인 유사 샘플을 제거하면 delusion 예방에 효과적.

Code Example

snippet

# Multi-agent voting으로 delusion 필터링 예시
import anthropic

def get_answer(client, model, question):
    response = client.messages.create(
        model=model,
        max_tokens=128,
        messages=[{"role": "user", "content": f"Answer concisely: {question}"}]
    )
    return response.content[0].text.strip()

def multi_agent_vote(question, threshold=2):
    client = anthropic.Anthropic()
    models = [
        "claude-opus-4-6",
        "claude-sonnet-4-6",
        "claude-haiku-4-5-20251001"
    ]
    answers = [get_answer(client, m, question) for m in models]
    
    # 과반수 동의 확인
    from collections import Counter
    counts = Counter(answers)
    top_answer, top_count = counts.most_common(1)[0]
    
    if top_count >= threshold:
        return top_answer, True   # 신뢰 가능
    else:
        return None, False        # delusion 의심, 거부

question = "Who wrote A Song of Ice and Fire?"
answer, trusted = multi_agent_vote(question, threshold=2)
print(f"Answer: {answer}, Trusted: {trusted}")

Terminology

HallucinationLLM이 사실과 다른 내용을 그럴싸하게 만들어내는 현상. 모델이 '지어낸' 답을 자신 있게 말하는 것.

Delusion이 논문에서 새로 정의한 개념. 환각 중에서도 모델이 특히 높은 확신을 갖는 케이스. 틀렸다고 알려줘도 고집하는 환각.

Logits모델이 각 토큰을 출력할 때 계산하는 원시 점수. 이 값이 높을수록 모델이 해당 단어를 더 확신한다고 볼 수 있음.

SFTSupervised Fine-Tuning의 약자. 정답 예시를 보여주고 따라하게 학습시키는 방법. 학교에서 풀이 보고 따라 푸는 것과 유사.

Confidence Calibration모델의 자신감(확신도)이 실제 정답률과 얼마나 일치하는지 맞추는 작업. 90% 확신이면 실제로 90% 맞아야 잘 캘리브레이션된 것.

RAGRetrieval-Augmented Generation. 모델이 답하기 전에 외부 문서를 검색해서 참고하게 하는 방법. 모델 내부 지식에만 의존하지 않도록 함.

Uncertainty Estimation모델이 자기 답에 얼마나 확신하는지 수치로 측정하는 기법. 높은 불확실성 = 모델이 잘 모른다는 신호.

Original Abstract (Expand)

Large Language Models often generate factually incorrect but plausible outputs, known as hallucinations. We identify a more insidious phenomenon, LLM delusion, defined as high belief hallucinations, incorrect outputs with abnormally high confidence, making them harder to detect and mitigate. Unlike ordinary hallucinations, delusions persist with low uncertainty, posing significant challenges to model reliability. Through empirical analysis across different model families and sizes on several Question Answering tasks, we show that delusions are prevalent and distinct from hallucinations. LLMs exhibit lower honesty with delusions, which are harder to override via finetuning or self reflection. We link delusion formation with training dynamics and dataset noise and explore mitigation strategies such as retrieval augmented generation and multi agent debating to mitigate delusions. By systematically investigating the nature, prevalence, and mitigation of LLM delusions, our study provides insights into the underlying causes of this phenomenon and outlines future directions for improving model reliability.