Mitigating Prompt-Induced Hallucinations in Large Language Models via Structured Reasoning

Jan 6, 2026•Jinbo Hao, Kai Yang, Qingzhen Su +3•View PDF

TL;DR Highlight

A method that directly embeds Knowledge Graph traversal code into Chain-of-Thought prompts to reduce hallucinations in GPT-4/LLaMA 3.3 by over 15%p on HIT@1.

Who Should Read

AI backend developers dealing with hallucination issues in LLM-based QA systems or knowledge retrieval pipelines — especially those who already have a Knowledge Graph or are designing structured reasoning pipelines.

Core Mechanics

Added a code module to the Knowledge Distillation Chain (KDCM) model to structure Knowledge Graph traversal — blocking error propagation that occurs when using only natural language reasoning
Embedding code within Chain-of-Thought prompts explicitly injects external knowledge, making the model less reliant on internal guesswork
The processing flow has 3 stages: decompose the prompt into subtasks → constrain intermediate reasoning steps with code and Knowledge Graph → generate the final answer based on verified intermediate results
Validated on 5 public datasets — WebQSP, CWQ, GSM8K, MWP, and Dr.SPIDER — using GPT-4 and LLaMA 3.3
Approximately 7–8%p higher HIT@1 performance compared to conventional RAG or Self-Check approaches, with especially large gains on multi-step math problems
Performance is maintained even with ambiguous or incomplete prompts — structural reasoning acts as a buffer against noise

Evidence

Adding the Code Module on top of KDCM alone improved WebQSP HIT@1 from 82.36% → 99.33% (+17%p) and CWQ from 81.36% → 97.86% (+16.5%p)
Across 5 datasets: average HIT@1 98.40%, HIT@3 96.83%, HIT@5 95.51% — compared to RAG baselines of 90.23%, 90.28%, and 90.18% respectively
In generalization tests, the proposed method achieved HIT@1 99.18% vs RAG 90.36% and Self-Check 90.28%
Overall improvements reported in the paper: HIT@1 +15.64%, HIT@3 +13.38%, HIT@5 +13.28% (relative to baseline KDCM)

How to Apply

Before sending a question to the LLM, query relevant entities and relationships from a Knowledge Graph (e.g., Virtuoso, Neo4j) using code, and insert the results as structured context at the beginning of the CoT prompt — similar to how RAG appends document chunks
Add a prompt stage that explicitly decomposes complex questions into subtasks, then design the pipeline as a chaining structure that validates each intermediate conclusion against the Knowledge Graph before passing it to the next step
Even without Knowledge Graph infrastructure, the idea can be partially applied — use a 'Code-first CoT' pattern where verifiable facts are computed with code (Python/SQL) before generating an answer, and those results are included in the prompt

Code Example

snippet

# Code-first CoT pattern example (partial application without a Knowledge Graph)
# Question: "As of 2024, how many times larger is South Korea's population than France's?"

# Step 1: Compute verifiable facts with code first
facts = {
    "korea_population": 51_700_000,
    "france_population": 68_000_000
}
ratio = facts["korea_population"] / facts["france_population"]

# Step 2: Include the computed results in the CoT prompt
prompt = f"""
[Structured Facts - Verified]
- South Korea population: {facts['korea_population']:,}
- France population: {facts['france_population']:,}
- Computed ratio: {ratio:.3f}

[Reasoning Instructions]
Based on the verified figures above, answer step by step.
1. Interpret the ratio: ...
2. Final answer: ...
"""

# Step 3: Call the LLM
response = llm.complete(prompt)

Terminology

HallucinationThe phenomenon where an LLM confidently generates content that is factually incorrect. It's like giving a wrong answer with full confidence on a question the model doesn't actually know.

Chain-of-ThoughtA prompting technique that instructs an LLM to 'think step by step.' Instead of jumping straight to an answer, the model is asked to write out its reasoning process, which improves accuracy.

Knowledge GraphA knowledge database that stores entities (people, places, concepts) and their relationships in a graph structure. Facts are represented as nodes and edges — e.g., 'Paris → capital of → France.'

Knowledge DistillationA training technique that transfers knowledge from a large model (Teacher) to a smaller model (Student). Similar to a student learning by observing a teacher's solution process.

HIT@KThe proportion of cases where the correct answer appears within the top K candidate answers. HIT@1 means the first answer is correct; HIT@5 means success if any of the top 5 answers is correct.

Prompt-Induced HallucinationHallucinations caused not by the model itself but by ambiguous or incomplete prompts. This occurs when a poorly framed question leads the model to reason in the wrong direction.

RAGA technique that retrieves external documents or databases in real time and appends them to LLM responses (Retrieval-Augmented Generation). It makes the model answer based on retrieved evidence rather than its internal memory.

Original Abstract (Expand)

To address hallucination issues in large language models (LLMs), this paper proposes a method for mitigating prompt-induced hallucinations. Building on a knowledge distillation chain-style model, we introduce a code module to guide knowledge-graph exploration and incorporate code as part of the chain-of-thought prompt, forming an external knowledge input that provides more accurate and structured information to the model. Based on this design, we develop an improved knowledge distillation chain-style model and leverage it to analyze and constrain the reasoning process of LLMs, thereby improving inference accuracy. We empirically evaluate the proposed approach using GPT-4 and LLaMA-3.3 on multiple public datasets. Experimental results demonstrate that incorporating code modules significantly enhances the model's ability to capture contextual information and effectively mitigates prompt-induced hallucinations. Specifically, HIT@1, HIT@3, and HIT@5 improve by 15.64%, 13.38%, and 13.28%, respectively. Moreover, the proposed method achieves HIT@1, HIT@3, and HIT@5 scores exceeding 95% across several evaluation settings. These results indicate that the proposed approach substantially reduces hallucination behavior while improving the accuracy and verifiability of large language models.