ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation
TL;DR Highlight
A benchmark dataset for systematically evaluating and reducing LLM hallucinations when analyzing ESG reports.
Who Should Read
Financial AI practitioners building ESG analysis tools, and researchers studying domain-specific hallucination in LLMs applied to compliance and sustainability reporting.
Core Mechanics
- ESG report analysis is a high-stakes domain where LLM hallucinations can cause real financial and compliance risks
- Introduced a benchmark dataset of ESG reports with ground-truth facts and paired hallucination test cases
- Evaluated major LLMs on ESG factual accuracy and found significant hallucination rates on domain-specific claims
- Identified common hallucination patterns specific to ESG: metric fabrication, regulatory misquoting, timeline errors
- The benchmark enables systematic comparison of hallucination mitigation strategies for ESG tasks
- Results show RAG approaches significantly reduce but don't eliminate hallucination in ESG contexts
Evidence
- Major LLMs show 20-40% hallucination rates on ESG-specific factual questions
- RAG with ESG document grounding reduces hallucination rates by 40-60%
- Common hallucination types identified through systematic error analysis of major models
- Human expert validation confirms benchmark quality and real-world relevance
How to Apply
- Evaluate your ESG analysis LLM pipeline on this benchmark before deployment
- The identified hallucination patterns (metric fabrication, regulatory misquoting) should be checked in your output validation pipeline
- For ESG applications, always use RAG with source document grounding — the benchmark shows ungrounded LLMs are unreliable for factual ESG claims
Code Example
# 4-step CoT prompt template example
system_prompt = """You are a factual QA assistant. Answer only based on the provided context.
If the answer is not in the context, respond with 'Not provided.'"""
def build_cot_prompt(context: str, question: str) -> str:
return f"""Context:
{context}
Question: {question}
Please answer step by step:
1. Identify the key topic or entity mentioned in the question: [TOPIC]
2. Search the context for sentences or paragraphs relevant to that topic: [RELEVANT_PASSAGES]
3. Determine if the context provides an answer to the question: [ANSWERABLE: yes/no]
4. Based on your reasoning, the correct answer should be: [ANSWER]"""
# Usage example
prompt = build_cot_prompt(
context="Company X reduced carbon emissions by 30% in 2023...",
question="What was Company X's carbon emission reduction target?"
)
print(prompt)Terminology
Related Resources
Original Abstract (Expand)
As corporate responsibility increasingly incorporates environmental, social, and governance (ESG) criteria, ESG reporting is becoming a legal requirement in many regions and a key channel for documenting sustainability practices and assessing firms' long-term and ethical performance. However, the length and complexity of ESG disclosures make them difficult to interpret and automate the analysis reliably. To support scalable and trustworthy analysis, this paper introduces ESG-Bench, a benchmark dataset for ESG report understanding and hallucination mitigation in large language models (LLMs). ESG-Bench contains human-annotated question-answer (QA) pairs grounded in real-world ESG report contexts, with fine-grained labels indicating whether model outputs are factually supported or hallucinated. Framing ESG report analysis as a QA task with verifiability constraints enables systematic evaluation of LLMs' ability to extract and reason over ESG content and provides a new use case: mitigating hallucinations in socially sensitive, compliance-critical settings. We design task-specific Chain-of-Thought (CoT) prompting strategies and fine-tune multiple state-of-the-art LLMs on ESG-Bench using CoT-annotated rationales. Our experiments show that these CoT-based methods substantially outperform standard prompting and direct fine-tuning in reducing hallucinations, and that the gains transfer to existing QA benchmarks beyond the ESG domain.