잘못된 질문에 답하기: LLM Abstention을 위한 Reasoning Trace Inversion

TL;DR Highlight

추론 흔적 역분석 방식이 모델의 실제 응답 대상을 재구성하고 원래 질문과 비교함으로써 LLM의 답변 거부(abstention) 판단 정확도를 높인다.

Who Should Read

LLM 기반 서비스에서 할루시네이션이나 부적절한 답변을 걸러내야 하는 백엔드/ML 엔지니어. 특히 DeepSeek-R1, GPT-o 시리즈 같은 reasoning 모델을 프로덕션에 배포하려는 개발자.

Core Mechanics

기존 abstention 방법들은 '모델 확신도(confidence)'로 답변 거부를 판단하는데, reasoning 모델(CoT를 쓰는 모델)에서는 이 방식이 특히 잘 안 된다. 확신도 높은 할루시네이션이 빈번하기 때문.
새로운 프레임워크 'Query Misalignment': 할루시네이션을 '틀린 답변'이 아닌 '다른 질문에 대한 답변'으로 재해석. 모델이 유저 질문 q를 받아 내부적으로 q*로 변환해서 답한다는 관점.
TRACE INVERSION 3단계 동작: ① 모델의 reasoning trace 생성 → ② trace만 보고 모델이 실제로 어떤 질문에 답했는지 q* 재구성 → ③ 원래 질문 q와 q* 유사도 비교 후 차이가 크면 abstain 플래그.
유사도 측정은 3가지 모듈의 앙상블 다수결: 문장 임베딩 코사인 유사도(SE), LLM 평가(TrInv-LLM), 그리고 IBM의 Granite-Guardian-3.3-8b를 활용한 grounding 검사(GROUND).
도메인별로 잘 맞는 모듈이 다름: SE는 수학 문제(84.2%), TrInv-LLM은 독해(73.3%), GROUND는 편향/안전 관련(75.2%) 도메인에서 강함.
기존 baseline에 CoT 프롬프트를 추가하면 abstention 성능이 평균 2.6% 하락 — 기존 방법들이 reasoning 모델에 적합하지 않다는 증거.

Evidence

TRACE INVERSION은 4개 모델(phi-4, Qwen2.5-32B, DeepSeek-R1-Distill-Qwen-32B, gpt-oss-120b) × 9개 데이터셋 = 36개 설정 중 33개에서 기존 최강 baseline 대비 최고 성능 달성.
평균 Abstain Accuracy 향상: 경쟁 method 대비 평균 8.7% 향상. phi-4 모델에서는 +11.6%, Qwen2.5-32B에서는 +9.5%.
답변 불가 질문 포함 데이터셋에서 baseline 대비 차이가 더 뚜렷: baseline들은 answerable vs unanswerable 간 성능 차이가 도메인별 13~20% 이상인 반면, TRACE INVERSION은 3~6%에 불과.
DeepSeek-R1-Distill-Qwen-32B 기준 전체 Abstain Accuracy 0.733, gpt-oss-120b 기준 0.762로, 각각 가장 강한 baseline(0.604, 0.648) 대비 크게 앞섬.

How to Apply

reasoning 모델(DeepSeek-R1, GPT-o 시리즈 등)을 사용하는 QA 파이프라인에서: 모델 응답 후 reasoning trace를 추출하고, 'Query Reconstruction Prompt'로 q* 재구성, 원래 질문과 문장 임베딩 코사인 유사도를 계산해 임계값 이하면 답변을 차단하는 가드레일로 쓸 수 있다.
비용이 부담된다면 앙상블 대신 단일 모듈로 시작: 수학/지식 QA 서비스라면 SE 모듈(sentence transformer all-MiniLM-L6-v2)만으로도 경쟁력 있는 성능(84.2%)을 얻을 수 있고, 편향/안전 관련 서비스라면 Granite-Guardian 같은 guardrail 모델을 활용하면 된다.
기존 confidence 기반 필터(토큰 확률, 모델에게 자신감 물어보기)가 reasoning 모델에서 오작동한다면 TRACE INVERSION으로 교체를 고려. 특히 '답 없는 질문', '전제가 잘못된 질문', '주관적 질문'이 많은 서비스에서 효과적이다.

Code Example

snippet

# TRACE INVERSION 핵심 로직 구현 예시
from sentence_transformers import SentenceTransformer, util

model_embed = SentenceTransformer('all-MiniLM-L6-v2')

# Step 1: LLM으로 reasoning trace 생성
def get_reasoning_trace(llm, query):
    prompt = f"Let's think step by step.\n\nQuestion: {query}\n\nReasoning:"
    return llm.generate(prompt)

# Step 2: trace만 보고 원래 질문 재구성
QUERY_RECONSTRUCTION_PROMPT = """
You are a puzzle solver. Given the following reasoning trace, 
reconstructthe initial question by interpreting the steps in the reasoning trace. 
Do not answer the question.

Reasoning Trace:
{reasoning_trace}

Reconstructed query:
"""

def reconstruct_query(llm, reasoning_trace):
    prompt = QUERY_RECONSTRUCTION_PROMPT.format(reasoning_trace=reasoning_trace)
    return llm.generate(prompt)

# Step 3-A: 문장 임베딩 유사도로 거리 측정 (SE Module)
def se_similarity(original_query, reconstructed_query, threshold=0.85):
    emb_q = model_embed.encode(original_query, convert_to_tensor=True)
    emb_q_star = model_embed.encode(reconstructed_query, convert_to_tensor=True)
    score = util.cos_sim(emb_q, emb_q_star).item()
    should_abstain = score < threshold
    return score, should_abstain

# Step 3-B: LLM 평가 모듈 (TrInv-LLM Module)
TRINV_LLM_PROMPT = """
Do the following two prompts convey the same framing, intent, and context?
Prompt 1: {q1}
Prompt 2: {q2}
Select YES or NO:
Final answer:
"""

def trinv_llm_check(llm, original_query, reconstructed_query):
    prompt = TRINV_LLM_PROMPT.format(q1=original_query, q2=reconstructed_query)
    response = llm.generate(prompt)
    should_abstain = 'NO' in response.upper()
    return should_abstain

# 앙상블: 다수결 투표
def trace_inversion(llm, query, threshold=0.85):
    trace = get_reasoning_trace(llm, query)
    q_star = reconstruct_query(llm, trace)
    
    _, se_abstain = se_similarity(query, q_star, threshold)
    llm_abstain = trinv_llm_check(llm, query, q_star)
    # GROUND module은 Granite-Guardian API 호출 필요 (생략)
    
    votes = [se_abstain, llm_abstain]  # GROUND 포함 시 3표
    should_abstain = sum(votes) > len(votes) / 2  # 다수결
    return should_abstain, q_star, trace

Terminology

AbstentionLLM이 모르거나 답하면 안 되는 질문에 '모르겠다' 또는 '답할 수 없다'고 거부하는 능력. 모든 질문에 무조건 답하려다 오히려 잘못된 정보를 내놓는 것을 막는 안전장치.

Reasoning TraceLLM이 답변하기 전에 '1단계: ..., 2단계: ...' 식으로 생각 과정을 글로 풀어내는 것. Chain-of-Thought(CoT)라고도 부르며, 사람이 시험지에 풀이 과정을 쓰는 것과 비슷.

HallucinationLLM이 실제로 없는 정보를 있는 것처럼 자신감 있게 생성하는 현상. 모델이 '지어내는' 것으로, 틀린 사실을 확신에 차서 말하는 것.

Abstain AccuracyLLM이 abstention 결정을 얼마나 잘했는지 측정하는 지표. '답해야 할 때 답했고, 안 답해야 할 때 안 답한' 비율.

Calibration모델이 '자기가 얼마나 확신하는지'를 실제 정답률과 일치시키는 것. 예: 70% 확신한다고 말한 답변의 실제 정답률이 70%여야 잘 calibrated된 모델.

Sentence Embedding문장의 의미를 숫자 벡터로 압축하는 기술. 비슷한 의미의 문장은 벡터 공간에서 가까이 위치하게 되어 코사인 유사도로 의미 차이를 계산할 수 있음.

Granite-GuardianIBM이 만든 가드레일 모델(Granite-Guardian-3.3-8b). 텍스트가 다른 텍스트에 근거하는지(grounded) 등을 감지하는 안전 필터 역할을 함.

CoT (Chain-of-Thought)모델이 최종 답 전에 중간 추론 단계를 순서대로 출력하게 하는 프롬프팅 기법. '차근차근 생각해봐'라고 유도하면 복잡한 문제 성능이 올라가는 효과가 있음.

Related Resources

TRACE INVERSION 공식 코드 저장소

Original Abstract (Expand)

For Large Language Models (LLMs) to be reliably deployed, models must effectively know when not to answer: abstain. Reasoning models, in particular, have gained attention for impressive performance on complex tasks. However, reasoning models have been shown to have worse abstention abilities. Taking the vulnerabilities of reasoning models into account, we propose our Query Misalignment Framework. Hallucinations resulting in failed abstention can be reinterpreted as LLMs answering the wrong question (rather than answering a question incorrectly). Based on this framework, we develop a new class of state-of-the-art abstention methods called Trace Inversion. First, we generate the reasoning trace of a model. Based on only the trace, we then reconstruct the most likely query that the model responded to. Finally, we compare the initial query with the reconstructed query. Low similarity score between the initial query and reconstructed query suggests that the model likely answered the question incorrectly and is flagged to abstain. Extensive experiments demonstrate that Trace Inversion effectively boosts abstention performance in four frontier LLMs across nine abstention QA datasets, beating competitive baselines in 33 out of 36 settings.