NAACL: Noise-AwAre Verbal Confidence Calibration for LLMs in RAG Systems

Jan 16, 2026•Jiayu Liu, Rui Wang, Qing Zong +7•View PDF

TL;DR Highlight

When wrong retrieval results mix into RAG, LLMs become confidently wrong — this paper fixes it with just 2K data fine-tuning

Who Should Read

Backend and ML engineers running RAG pipelines in production who want to fix models confidently giving wrong answers. Especially teams in domains like healthcare, law, and finance where model reliability matters.

Core Mechanics

All models tested in RAG settings (Llama-3.1-8B, Qwen2.5-7B, DeepSeek-R1-Distill family) had average ECE (confidence-accuracy mismatch metric) exceeding 0.4 — far above the 0.25 baseline threshold
When counterfactual passages mix in, models maintain high confidence even in wrong answers — 'overconfidence' persists despite conflicting evidence
Even irrelevant passages boost confidence — more text means stronger baseless certainty
Proposes 3 NAACL Rules: (1) use internal knowledge over external when evidence conflicts, (2) ignore irrelevant passages, (3) fall back to parametric knowledge when no valid passages exist
LoRA SFT on 2K HotpotQA examples teaches the model to judge passage quality and adjust confidence accordingly — no external teacher model needed
Simple 'Label-only SFT' (just training on answers + confidence labels) is worse than NAACL — confidence improvement comes from noise-aware reasoning, not label fitting

Evidence

After NAACL: in-domain ECE improved by 10.9% on average, out-of-distribution ECE improved by 8.0% (across 4 benchmarks)
Llama-3.1-8B: ECE dropped from 0.377 to 0.266 (~11% reduction vs CoT), AUROC improved from 0.591 to 0.751
Adding counterfactual noise: Llama-3.1-8B ECE surged 31.6% vs Gold-only, DeepSeek-R1-Distill-Llama-8B surged 35.1%
Trained with k=3 passages but tested with k=5 still showed 8% ECE improvement over Vanilla — confirming generalization

How to Apply

Immediately applicable prompt approach: use the paper's noise-aware prompt (Figure 8) as-is — classify passages as 'Highly Relevant / Relevant / Irrelevant' and follow 3 rules for ECE improvement over CoT without fine-tuning
For fine-tuning: generate counterfactual/consistent/irrelevant scenario passages on 2K HotpotQA samples, then LoRA SFT — use the public code (https://github.com/HKUST-KnowComp/NAACL) with LLaMA-Factory
When exposing RAG answer confidence to users, having the model explicitly judge passages before outputting confidence makes it interpretable — traceable 'why this confidence level' rather than just a number

Code Example

snippet

# Noise-Aware Prompt (ready to use, no fine-tuning required)
SYSTEM_PROMPT = """
You will be asked a question with 3 retrieved passages.
Classify each passage:
- Highly Relevant: directly states or strongly indicates an answer
- Relevant: shares topic/keywords but lacks specific answer info
- Irrelevant: no shared topic or keywords

Rules:
1. If multiple passages are Highly Relevant AND contradictory:
   → Use your own knowledge, report corresponding confidence
2. If exactly one passage is Highly Relevant:
   → Answer based on that passage, report corresponding confidence
3. If no passage is Highly Relevant:
   → Use your own knowledge, report corresponding confidence

Think step by step, classify each passage, then output:
Final Answer: [answer]
Confidence: [0%-100%]
"""

# Usage example
user_message = f"""
Question: {question}
Retrieved Passages:
[P1] {passage1}
[P2] {passage2}
[P3] {passage3}
"""

# Confidence parsing
import re
def parse_confidence(response: str) -> float:
    match = re.search(r'Confidence:\s*(\d+)%', response)
    return int(match.group(1)) / 100 if match else 0.5

Terminology

ECEExpected Calibration Error. Measures whether a model saying '70% confident' actually gets it right 70% of the time. 0 is ideal; 0.4 means confidence and actual performance are severely misaligned.

AUROCArea Under the ROC Curve. Measures how well the model distinguishes 'correct vs incorrect.' 0.5 is random guessing, 1.0 is perfect discrimination.

verbal confidenceThe model expressing its confidence directly as numbers or probability values. Formats like '70%' or 'very confident.' Usable even on closed models like GPT-4 since you don't need to access internal logits.

calibrationHow well a model's confidence matches actual accuracy. A well-calibrated model saying '80% confident' actually gets it right 80 out of 100 times. Overconfidence means high confidence despite being wrong.

counterfactual passageA passage related to the question but containing plausible information opposite to the correct answer. E.g., 'Premier League is broadcast on TF1' instead of the correct 'Canal+.'

SFTSupervised Fine-Tuning. Teaching through correct examples. Simpler and cheaper than RL; this paper uses LoRA with only 2K data for efficiency.

LoRAA fine-tuning technique that trains only a small number of parameters. Instead of changing the entire model, adds small adapter layers — saving GPU memory and time significantly.

parametric knowledgeKnowledge stored in model weights through training. The model's own internal knowledge without external retrieval. One key strategy in this paper is falling back to this when all retrieved passages are useless or conflicting.

Related Resources

https://github.com/HKUST-KnowComp/NAACL

Original Abstract (Expand)

Accurately assessing model confidence is essential for deploying large language models (LLMs) in mission-critical factual domains. While retrieval-augmented generation (RAG) is widely adopted to improve grounding, confidence calibration in RAG settings remains poorly understood. We conduct a systematic study across four benchmarks, revealing that LLMs exhibit poor calibration performance due to noisy retrieved contexts. Specifically, contradictory or irrelevant evidence tends to inflate the model's false certainty, leading to severe overconfidence. To address this, we propose NAACL Rules (Noise-AwAre Confidence CaLibration Rules) to provide a principled foundation for resolving overconfidence under noise. We further design NAACL, a noise-aware calibration framework that synthesizes supervision from about 2K HotpotQA examples guided by these rules. By performing supervised fine-tuning (SFT) with this data, NAACL equips models with intrinsic noise awareness without relying on stronger teacher models. Empirical results show that NAACL yields substantial gains, improving ECE scores by 10.9% in-domain and 8.0% out-of-domain. By bridging the gap between retrieval noise and verbal calibration, NAACL paves the way for both accurate and epistemically reliable LLMs.