StructGPT: A General Framework for Large Language Model to Reason over Structured Data
TL;DR Highlight
A framework that attaches dedicated interfaces so LLMs can directly read and reason from Knowledge Graphs, tables, and databases.
Who Should Read
Backend/ML developers wanting to build QA systems by connecting Knowledge Graphs or relational DBs to LLMs. Anyone wondering how to get structured data into LLM prompts.
Core Mechanics
- Solves the problem of LLMs struggling to directly understand structured data (KG, tables, DBs) with a 'dedicated interface + iterative read-and-reason' loop
- Treats structured data as a black box, designing data access functions per type (e.g., Extract_Neighbor_Relations, Extract_Columns) as API-like interfaces
- Each iteration: call interface → convert result to text (linearization) → LLM generates — no need to dump the entire dataset at once
- Experiments with both ChatGPT (gpt-3.5-turbo) and Davinci-003 (text-davinci-003) in zero-shot / few-shot settings
- Achieves near full-data supervised model performance without fine-tuning
- Error analysis: KGQA bottleneck is relation selection failure (74%), Text-to-SQL bottleneck is reasoning errors (63%)
Evidence
- KGQA WebQSP: ChatGPT zero-shot 61.2 → StructGPT 72.6 (+11.4% Hits@1)
- TableQA TabFact: ChatGPT zero-shot 82.9 → StructGPT few-shot 87.6 (+4.7% accuracy)
- Text-to-SQL Spider: ChatGPT zero-shot 70.1 → StructGPT few-shot 77.8 (+7.7% execution accuracy)
- ChatGPT August version: WebQSP 62.1→75.3, WTQ 41.1→50.4, Spider 75.2→77.1 — consistent improvement regardless of version
How to Apply
- For DB-connected QA: chain LLM calls as Extract_Table&Column_Name → select relevant tables → Extract_Tables_Information → generate SQL. No need to cram the entire schema into the prompt at once.
- For Knowledge Graph search systems: first extract only neighbor relations for an entity and pass to LLM for filtering, then extract triples via selected relations — 2-step calls reduce LLM context waste.
- For table QA: make the LLM progressively narrow scope as column names → relevant column contents → sub-table, enabling long table processing without context overflow.
Code Example
# StructGPT-style KG reasoning chain example (pseudo-code)
def structgpt_kgqa(question, topic_entity, kg):
# Step 1: Extract neighbor relations
relations = kg.extract_neighbor_relations(topic_entity)
linearized = ', '.join(relations)
prompt1 = f"""
The candidate relations of '{topic_entity}': {linearized}.
The question is: {question}
Provide only one relevant relation that's present in the candidates.
Relevant relation:"""
selected_relation = llm_generate(prompt1).strip()
# Step 2: Extract triples for the selected relation
triples = kg.extract_triples(topic_entity, [selected_relation])
linearized_triples = '; '.join([f'({e}, {r}, {t})' for e, r, t in triples])
prompt2 = f"""
The triples are: {linearized_triples}
Based on these triples, answer the question: {question}
Provide only one answer entity.
Answer:"""
answer = llm_generate(prompt2).strip()
return answer
# Text-to-SQL chain example
def structgpt_text2sql(question, db):
# Step 1: Extract all table/column names
schema_summary = db.extract_table_column_names()
prompt1 = f"""
Here are the SQLite tables: {schema_summary}
Question: {question}
Which tables do you need to complete the SQLite SQL query?
Tables:"""
selected_tables = llm_generate(prompt1).strip().split(', ')
# Step 2: Extract detailed information (including FK) for selected tables
table_info = db.extract_tables_information(selected_tables)
prompt2 = f"""
### SQLite tables with their properties:
{table_info}
### Question: {question}
### Complete SQLite SQL query only with no explanation:
SELECT"""
sql = 'SELECT ' + llm_generate(prompt2).strip()
return sqlTerminology
Related Resources
Original Abstract (Expand)
In this paper, we study how to improve the zero-shot reasoning ability of large language models~(LLMs) over structured data in a unified way. Inspired by the study on tool augmentation for LLMs, we develop an \emph{Iterative Reading-then-Reasoning~(IRR)} approach for solving question answering tasks based on structured data, called \textbf{StructGPT}. In our approach, we construct the specialized function to collect relevant evidence from structured data (\ie \emph{reading}), and let LLMs concentrate the reasoning task based on the collected information (\ie \emph{reasoning}). Specially, we propose an \emph{invoking-linearization-generation} procedure to support LLMs in reasoning on the structured data with the help of the external interfaces. By iterating this procedures with provided interfaces, our approach can gradually approach the target answer to a given query. Extensive experiments conducted on three types of structured data demonstrate the effectiveness of our approach, which can significantly boost the performance of ChatGPT and achieve comparable performance against the full-data supervised-tuning baselines. Our codes and data are publicly available at~\url{https://github.com/RUCAIBox/StructGPT}.