Mitigating Prompt-Induced Hallucinations in Large Language Models via Structured Reasoning
TL;DR Highlight
A method that directly embeds Knowledge Graph traversal code into Chain-of-Thought prompts to reduce hallucinations in GPT-4/LLaMA 3.3 by over 15%p on HIT@1.
Who Should Read
AI backend developers dealing with hallucination issues in LLM-based QA systems or knowledge retrieval pipelines — especially those who already have a Knowledge Graph or are designing structured reasoning pipelines.
Core Mechanics
- Added a code module to the Knowledge Distillation Chain (KDCM) model to structure Knowledge Graph traversal — blocking error propagation that occurs when using only natural language reasoning
- Embedding code within Chain-of-Thought prompts explicitly injects external knowledge, making the model less reliant on internal guesswork
- The processing flow has 3 stages: decompose the prompt into subtasks → constrain intermediate reasoning steps with code and Knowledge Graph → generate the final answer based on verified intermediate results
- Validated on 5 public datasets — WebQSP, CWQ, GSM8K, MWP, and Dr.SPIDER — using GPT-4 and LLaMA 3.3
- Approximately 7–8%p higher HIT@1 performance compared to conventional RAG or Self-Check approaches, with especially large gains on multi-step math problems
- Performance is maintained even with ambiguous or incomplete prompts — structural reasoning acts as a buffer against noise
Evidence
- Adding the Code Module on top of KDCM alone improved WebQSP HIT@1 from 82.36% → 99.33% (+17%p) and CWQ from 81.36% → 97.86% (+16.5%p)
- Across 5 datasets: average HIT@1 98.40%, HIT@3 96.83%, HIT@5 95.51% — compared to RAG baselines of 90.23%, 90.28%, and 90.18% respectively
- In generalization tests, the proposed method achieved HIT@1 99.18% vs RAG 90.36% and Self-Check 90.28%
- Overall improvements reported in the paper: HIT@1 +15.64%, HIT@3 +13.38%, HIT@5 +13.28% (relative to baseline KDCM)
How to Apply
- Before sending a question to the LLM, query relevant entities and relationships from a Knowledge Graph (e.g., Virtuoso, Neo4j) using code, and insert the results as structured context at the beginning of the CoT prompt — similar to how RAG appends document chunks
- Add a prompt stage that explicitly decomposes complex questions into subtasks, then design the pipeline as a chaining structure that validates each intermediate conclusion against the Knowledge Graph before passing it to the next step
- Even without Knowledge Graph infrastructure, the idea can be partially applied — use a 'Code-first CoT' pattern where verifiable facts are computed with code (Python/SQL) before generating an answer, and those results are included in the prompt
Code Example
# Code-first CoT pattern example (partial application without a Knowledge Graph)
# Question: "As of 2024, how many times larger is South Korea's population than France's?"
# Step 1: Compute verifiable facts with code first
facts = {
"korea_population": 51_700_000,
"france_population": 68_000_000
}
ratio = facts["korea_population"] / facts["france_population"]
# Step 2: Include the computed results in the CoT prompt
prompt = f"""
[Structured Facts - Verified]
- South Korea population: {facts['korea_population']:,}
- France population: {facts['france_population']:,}
- Computed ratio: {ratio:.3f}
[Reasoning Instructions]
Based on the verified figures above, answer step by step.
1. Interpret the ratio: ...
2. Final answer: ...
"""
# Step 3: Call the LLM
response = llm.complete(prompt)Terminology
Original Abstract (Expand)
To address hallucination issues in large language models (LLMs), this paper proposes a method for mitigating prompt-induced hallucinations. Building on a knowledge distillation chain-style model, we introduce a code module to guide knowledge-graph exploration and incorporate code as part of the chain-of-thought prompt, forming an external knowledge input that provides more accurate and structured information to the model. Based on this design, we develop an improved knowledge distillation chain-style model and leverage it to analyze and constrain the reasoning process of LLMs, thereby improving inference accuracy. We empirically evaluate the proposed approach using GPT-4 and LLaMA-3.3 on multiple public datasets. Experimental results demonstrate that incorporating code modules significantly enhances the model's ability to capture contextual information and effectively mitigates prompt-induced hallucinations. Specifically, HIT@1, HIT@3, and HIT@5 improve by 15.64%, 13.38%, and 13.28%, respectively. Moreover, the proposed method achieves HIT@1, HIT@3, and HIT@5 scores exceeding 95% across several evaluation settings. These results indicate that the proposed approach substantially reduces hallucination behavior while improving the accuracy and verifiability of large language models.