ESG-Bench: 긴 ESG 보고서에서 Hallucination 완화를 위한 벤치마크

ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

Mar 13, 2026•Siqi Sun, Ben Peng Wu, Mali Jin +3•View PDF

TL;DR Highlight

ESG 보고서 분석 시 LLM이 사실을 꾸며내는 문제를 체계적으로 평가하고 줄이기 위한 벤치마크 데이터셋.

Who Should Read

ESG 보고서 자동 분석 시스템을 구축하거나, LLM 기반 컴플라이언스 도구에서 hallucination 문제를 해결하려는 백엔드/ML 엔지니어. 금융·지속가능성 분야에서 LLM 신뢰성을 높여야 하는 개발자.

Core Mechanics

ESG 보고서 QA용 벤치마크 ESG-Bench 공개 — 94개 실제 보고서에서 추출한 270개 QA 쌍, 각각 hallucination 레이블(Correct/Hallucination/Incomplete/Not Found) 부여
Hallucination을 두 종류로 분류: 없는 정보를 추가하는 'Additive'와 답이 있는데 못 찾는 'Omissive' — 실무에서 두 케이스 대응 전략이 달라야 함
4-step CoT(Chain-of-Thought) 파인튜닝이 가장 효과적 — 토픽 식별 → 관련 문장 검색 → 답변 가능 여부 판단 → 최종 답변 순으로 구조화
평가 모델: Llama-3.2-3B-Instruct, Gemma-2-2B-it, Mistral-7B-Instruct-v0.3 — 모두 CoT 파인튜닝 후 WoA(답 없는 케이스) 정확도가 크게 향상
GPT-4o의 groundedness(근거 기반) 판단을 프록시 레이블로 활용 가능 — 사람 어노테이션과 80.4% 일치, 대규모 레이블 생성 비용 절감 가능
ESG 도메인 밖(BioASQ, HaluEval)에도 성능 향상 전이 — 도메인 특화 CoT 전략이 일반 long-context QA에도 유효

Evidence

4-step CoT 파인튜닝 후 LLaMA ESG-Bench Balanced Accuracy 96.00% (기본 파인튜닝 SFT 90.67% 대비 향상), WoA Accuracy 99.37%
Mistral 4-step CoT BioASQ Balanced Accuracy 98.25%, WoA Accuracy 99.50% (파인튜닝 없이 93.50%에서 개선)
Gemma 4-step CoT BioASQ Balanced Accuracy 99.50%, WA Accuracy 100% 달성
어노테이터 간 Cohen's Kappa 일치율: Group 3은 86.67%(near-perfect), Group 1, 2는 각각 68.89%, 73.33%(substantial agreement)

How to Apply

답변 생성 시 4-step CoT 프롬프트 적용: '①질문 핵심 토픽 식별 → ②문서에서 관련 단락 검색 → ③답변 가능 여부 판단 → ④최종 답변 생성' 순서로 구조화하면 hallucination이 줄어듦
LLM이 답을 모를 때 'Not provided'를 정확히 반환하게 하려면 WoA(Without Answer) 케이스를 명시적으로 학습 데이터에 포함시키고 SFT 후 CoT 파인튜닝을 단계적으로 적용
대규모 레이블링 비용이 부담된다면 GPT-4o에게 '이 답변이 본문에 근거하나요? yes/no'를 묻는 방식으로 프록시 레이블을 생성해 파인튜닝 데이터로 활용 가능

Code Example

snippet

# 4-step CoT 프롬프트 템플릿 예시
system_prompt = """You are a factual QA assistant. Answer only based on the provided context.
If the answer is not in the context, respond with 'Not provided.'"""

def build_cot_prompt(context: str, question: str) -> str:
    return f"""Context:
{context}

Question: {question}

Please answer step by step:
1. Identify the key topic or entity mentioned in the question: [TOPIC]
2. Search the context for sentences or paragraphs relevant to that topic: [RELEVANT_PASSAGES]
3. Determine if the context provides an answer to the question: [ANSWERABLE: yes/no]
4. Based on your reasoning, the correct answer should be: [ANSWER]"""

# 사용 예시
prompt = build_cot_prompt(
    context="Company X reduced carbon emissions by 30% in 2023...",
    question="What was Company X's carbon emission reduction target?"
)
print(prompt)

Terminology

HallucinationLLM이 문서에 없는 내용을 마치 사실인 것처럼 만들어내는 현상. 없는 인용을 추가하거나 수치를 임의로 생성하는 것이 대표적 예.

CoT (Chain-of-Thought)모델이 최종 답변 전에 중간 추론 단계를 거치도록 유도하는 기법. 수학 문제 풀 때 '풀이 과정'을 쓰게 하는 것과 비슷.

SFT (Supervised Fine-tuning)정답 예시를 보여주고 따라하게 하는 학습법. 모범답안을 반복적으로 학습시켜 모델 행동을 교정함.

WA/WoA (With Answer / Without Answer)문서에 답이 있는 케이스(WA)와 없는 케이스(WoA)를 구분하는 평가 기준. WoA에서 모델이 'Not provided'를 정확히 반환해야 hallucination이 줄어듦.

Cohen's Kappa두 사람이 같은 항목을 얼마나 일관되게 평가했는지 측정하는 통계 지표. 단순 일치율과 달리 우연히 맞춘 경우를 보정함.

Groundedness모델의 답변이 주어진 문서나 컨텍스트에 실제로 근거하고 있는 정도. 근거 없는 답변이 많을수록 신뢰하기 어려운 모델.

ESG (Environmental, Social, Governance)기업의 환경(탄소 배출 등), 사회(노동 환경 등), 지배구조(투명성 등)를 평가하는 비재무적 지표. 최근 EU 규정 등으로 공시 의무화 추세.

Related Resources

Original Abstract (Expand)

As corporate responsibility increasingly incorporates environmental, social, and governance (ESG) criteria, ESG reporting is becoming a legal requirement in many regions and a key channel for documenting sustainability practices and assessing firms' long-term and ethical performance. However, the length and complexity of ESG disclosures make them difficult to interpret and automate the analysis reliably. To support scalable and trustworthy analysis, this paper introduces ESG-Bench, a benchmark dataset for ESG report understanding and hallucination mitigation in large language models (LLMs). ESG-Bench contains human-annotated question-answer (QA) pairs grounded in real-world ESG report contexts, with fine-grained labels indicating whether model outputs are factually supported or hallucinated. Framing ESG report analysis as a QA task with verifiability constraints enables systematic evaluation of LLMs' ability to extract and reason over ESG content and provides a new use case: mitigating hallucinations in socially sensitive, compliance-critical settings. We design task-specific Chain-of-Thought (CoT) prompting strategies and fine-tune multiple state-of-the-art LLMs on ESG-Bench using CoT-annotated rationales. Our experiments show that these CoT-based methods substantially outperform standard prompting and direct fine-tuning in reducing hallucinations, and that the gains transfer to existing QA benchmarks beyond the ESG domain.