NAACL: Noise-AwAre Verbal Confidence Calibration for LLMs in RAG Systems
TL;DR Highlight
When wrong retrieval results mix into RAG, LLMs become confidently wrong — this paper fixes it with just 2K data fine-tuning
Who Should Read
Backend and ML engineers running RAG pipelines in production who want to fix models confidently giving wrong answers. Especially teams in domains like healthcare, law, and finance where model reliability matters.
Core Mechanics
- All models tested in RAG settings (Llama-3.1-8B, Qwen2.5-7B, DeepSeek-R1-Distill family) had average ECE (confidence-accuracy mismatch metric) exceeding 0.4 — far above the 0.25 baseline threshold
- When counterfactual passages mix in, models maintain high confidence even in wrong answers — 'overconfidence' persists despite conflicting evidence
- Even irrelevant passages boost confidence — more text means stronger baseless certainty
- Proposes 3 NAACL Rules: (1) use internal knowledge over external when evidence conflicts, (2) ignore irrelevant passages, (3) fall back to parametric knowledge when no valid passages exist
- LoRA SFT on 2K HotpotQA examples teaches the model to judge passage quality and adjust confidence accordingly — no external teacher model needed
- Simple 'Label-only SFT' (just training on answers + confidence labels) is worse than NAACL — confidence improvement comes from noise-aware reasoning, not label fitting
Evidence
- After NAACL: in-domain ECE improved by 10.9% on average, out-of-distribution ECE improved by 8.0% (across 4 benchmarks)
- Llama-3.1-8B: ECE dropped from 0.377 to 0.266 (~11% reduction vs CoT), AUROC improved from 0.591 to 0.751
- Adding counterfactual noise: Llama-3.1-8B ECE surged 31.6% vs Gold-only, DeepSeek-R1-Distill-Llama-8B surged 35.1%
- Trained with k=3 passages but tested with k=5 still showed 8% ECE improvement over Vanilla — confirming generalization
How to Apply
- Immediately applicable prompt approach: use the paper's noise-aware prompt (Figure 8) as-is — classify passages as 'Highly Relevant / Relevant / Irrelevant' and follow 3 rules for ECE improvement over CoT without fine-tuning
- For fine-tuning: generate counterfactual/consistent/irrelevant scenario passages on 2K HotpotQA samples, then LoRA SFT — use the public code (https://github.com/HKUST-KnowComp/NAACL) with LLaMA-Factory
- When exposing RAG answer confidence to users, having the model explicitly judge passages before outputting confidence makes it interpretable — traceable 'why this confidence level' rather than just a number
Code Example
# Noise-Aware Prompt (ready to use, no fine-tuning required)
SYSTEM_PROMPT = """
You will be asked a question with 3 retrieved passages.
Classify each passage:
- Highly Relevant: directly states or strongly indicates an answer
- Relevant: shares topic/keywords but lacks specific answer info
- Irrelevant: no shared topic or keywords
Rules:
1. If multiple passages are Highly Relevant AND contradictory:
→ Use your own knowledge, report corresponding confidence
2. If exactly one passage is Highly Relevant:
→ Answer based on that passage, report corresponding confidence
3. If no passage is Highly Relevant:
→ Use your own knowledge, report corresponding confidence
Think step by step, classify each passage, then output:
Final Answer: [answer]
Confidence: [0%-100%]
"""
# Usage example
user_message = f"""
Question: {question}
Retrieved Passages:
[P1] {passage1}
[P2] {passage2}
[P3] {passage3}
"""
# Confidence parsing
import re
def parse_confidence(response: str) -> float:
match = re.search(r'Confidence:\s*(\d+)%', response)
return int(match.group(1)) / 100 if match else 0.5Terminology
Related Resources
Original Abstract (Expand)
Accurately assessing model confidence is essential for deploying large language models (LLMs) in mission-critical factual domains. While retrieval-augmented generation (RAG) is widely adopted to improve grounding, confidence calibration in RAG settings remains poorly understood. We conduct a systematic study across four benchmarks, revealing that LLMs exhibit poor calibration performance due to noisy retrieved contexts. Specifically, contradictory or irrelevant evidence tends to inflate the model's false certainty, leading to severe overconfidence. To address this, we propose NAACL Rules (Noise-AwAre Confidence CaLibration Rules) to provide a principled foundation for resolving overconfidence under noise. We further design NAACL, a noise-aware calibration framework that synthesizes supervision from about 2K HotpotQA examples guided by these rules. By performing supervised fine-tuning (SFT) with this data, NAACL equips models with intrinsic noise awareness without relying on stronger teacher models. Empirical results show that NAACL yields substantial gains, improving ECE scores by 10.9% in-domain and 8.0% out-of-domain. By bridging the gap between retrieval noise and verbal calibration, NAACL paves the way for both accurate and epistemically reliable LLMs.