ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

Mar 13, 2026•Siqi Sun, Ben Peng Wu, Mali Jin +3•View PDF

TL;DR Highlight

A benchmark dataset for systematically evaluating and reducing LLM hallucinations when analyzing ESG reports.

Who Should Read

Financial AI practitioners building ESG analysis tools, and researchers studying domain-specific hallucination in LLMs applied to compliance and sustainability reporting.

Core Mechanics

ESG report analysis is a high-stakes domain where LLM hallucinations can cause real financial and compliance risks
Introduced a benchmark dataset of ESG reports with ground-truth facts and paired hallucination test cases
Evaluated major LLMs on ESG factual accuracy and found significant hallucination rates on domain-specific claims
Identified common hallucination patterns specific to ESG: metric fabrication, regulatory misquoting, timeline errors
The benchmark enables systematic comparison of hallucination mitigation strategies for ESG tasks
Results show RAG approaches significantly reduce but don't eliminate hallucination in ESG contexts

Evidence

Major LLMs show 20-40% hallucination rates on ESG-specific factual questions
RAG with ESG document grounding reduces hallucination rates by 40-60%
Common hallucination types identified through systematic error analysis of major models
Human expert validation confirms benchmark quality and real-world relevance

How to Apply

Evaluate your ESG analysis LLM pipeline on this benchmark before deployment
The identified hallucination patterns (metric fabrication, regulatory misquoting) should be checked in your output validation pipeline
For ESG applications, always use RAG with source document grounding — the benchmark shows ungrounded LLMs are unreliable for factual ESG claims

Code Example

snippet

# 4-step CoT prompt template example
system_prompt = """You are a factual QA assistant. Answer only based on the provided context.
If the answer is not in the context, respond with 'Not provided.'"""

def build_cot_prompt(context: str, question: str) -> str:
    return f"""Context:
{context}

Question: {question}

Please answer step by step:
1. Identify the key topic or entity mentioned in the question: [TOPIC]
2. Search the context for sentences or paragraphs relevant to that topic: [RELEVANT_PASSAGES]
3. Determine if the context provides an answer to the question: [ANSWERABLE: yes/no]
4. Based on your reasoning, the correct answer should be: [ANSWER]"""

# Usage example
prompt = build_cot_prompt(
    context="Company X reduced carbon emissions by 30% in 2023...",
    question="What was Company X's carbon emission reduction target?"
)
print(prompt)

Terminology

ESG (Environmental, Social, Governance)A framework for evaluating company sustainability and ethical impact — ESG reports contain quantitative metrics and compliance claims.

HallucinationLLM outputs that sound plausible but contain fabricated or incorrect facts not grounded in the input or reality.

RAGRetrieval-Augmented Generation — grounding LLM responses in retrieved source documents to reduce hallucination.

Metric FabricationA specific hallucination type where LLMs invent specific numerical values (emissions figures, scores) that weren't in the source documents.

Related Resources

Original Abstract (Expand)

As corporate responsibility increasingly incorporates environmental, social, and governance (ESG) criteria, ESG reporting is becoming a legal requirement in many regions and a key channel for documenting sustainability practices and assessing firms' long-term and ethical performance. However, the length and complexity of ESG disclosures make them difficult to interpret and automate the analysis reliably. To support scalable and trustworthy analysis, this paper introduces ESG-Bench, a benchmark dataset for ESG report understanding and hallucination mitigation in large language models (LLMs). ESG-Bench contains human-annotated question-answer (QA) pairs grounded in real-world ESG report contexts, with fine-grained labels indicating whether model outputs are factually supported or hallucinated. Framing ESG report analysis as a QA task with verifiability constraints enables systematic evaluation of LLMs' ability to extract and reason over ESG content and provides a new use case: mitigating hallucinations in socially sensitive, compliance-critical settings. We design task-specific Chain-of-Thought (CoT) prompting strategies and fine-tune multiple state-of-the-art LLMs on ESG-Bench using CoT-annotated rationales. Our experiments show that these CoT-based methods substantially outperform standard prompting and direct fine-tuning in reducing hallucinations, and that the gains transfer to existing QA benchmarks beyond the ESG domain.