StructGPT: A General Framework for Large Language Model to Reason over Structured Data

May 16, 2023•Jinhao Jiang, Kun Zhou, Zican Dong +3•View PDF

TL;DR Highlight

A framework that attaches dedicated interfaces so LLMs can directly read and reason from Knowledge Graphs, tables, and databases.

Who Should Read

Backend/ML developers wanting to build QA systems by connecting Knowledge Graphs or relational DBs to LLMs. Anyone wondering how to get structured data into LLM prompts.

Core Mechanics

Solves the problem of LLMs struggling to directly understand structured data (KG, tables, DBs) with a 'dedicated interface + iterative read-and-reason' loop
Treats structured data as a black box, designing data access functions per type (e.g., Extract_Neighbor_Relations, Extract_Columns) as API-like interfaces
Each iteration: call interface → convert result to text (linearization) → LLM generates — no need to dump the entire dataset at once
Experiments with both ChatGPT (gpt-3.5-turbo) and Davinci-003 (text-davinci-003) in zero-shot / few-shot settings
Achieves near full-data supervised model performance without fine-tuning
Error analysis: KGQA bottleneck is relation selection failure (74%), Text-to-SQL bottleneck is reasoning errors (63%)

Evidence

KGQA WebQSP: ChatGPT zero-shot 61.2 → StructGPT 72.6 (+11.4% Hits@1)
TableQA TabFact: ChatGPT zero-shot 82.9 → StructGPT few-shot 87.6 (+4.7% accuracy)
Text-to-SQL Spider: ChatGPT zero-shot 70.1 → StructGPT few-shot 77.8 (+7.7% execution accuracy)
ChatGPT August version: WebQSP 62.1→75.3, WTQ 41.1→50.4, Spider 75.2→77.1 — consistent improvement regardless of version

How to Apply

For DB-connected QA: chain LLM calls as Extract_Table&Column_Name → select relevant tables → Extract_Tables_Information → generate SQL. No need to cram the entire schema into the prompt at once.
For Knowledge Graph search systems: first extract only neighbor relations for an entity and pass to LLM for filtering, then extract triples via selected relations — 2-step calls reduce LLM context waste.
For table QA: make the LLM progressively narrow scope as column names → relevant column contents → sub-table, enabling long table processing without context overflow.

Code Example

snippet

# StructGPT-style KG reasoning chain example (pseudo-code)

def structgpt_kgqa(question, topic_entity, kg):
    # Step 1: Extract neighbor relations
    relations = kg.extract_neighbor_relations(topic_entity)
    linearized = ', '.join(relations)
    
    prompt1 = f"""
The candidate relations of '{topic_entity}': {linearized}.
The question is: {question}
Provide only one relevant relation that's present in the candidates.
Relevant relation:"""
    selected_relation = llm_generate(prompt1).strip()
    
    # Step 2: Extract triples for the selected relation
    triples = kg.extract_triples(topic_entity, [selected_relation])
    linearized_triples = '; '.join([f'({e}, {r}, {t})' for e, r, t in triples])
    
    prompt2 = f"""
The triples are: {linearized_triples}
Based on these triples, answer the question: {question}
Provide only one answer entity.
Answer:"""
    answer = llm_generate(prompt2).strip()
    return answer

# Text-to-SQL chain example
def structgpt_text2sql(question, db):
    # Step 1: Extract all table/column names
    schema_summary = db.extract_table_column_names()
    prompt1 = f"""
Here are the SQLite tables: {schema_summary}
Question: {question}
Which tables do you need to complete the SQLite SQL query?
Tables:"""
    selected_tables = llm_generate(prompt1).strip().split(', ')
    
    # Step 2: Extract detailed information (including FK) for selected tables
    table_info = db.extract_tables_information(selected_tables)
    prompt2 = f"""
### SQLite tables with their properties:
{table_info}
### Question: {question}
### Complete SQLite SQL query only with no explanation:
SELECT"""
    sql = 'SELECT ' + llm_generate(prompt2).strip()
    return sql

Terminology

Knowledge Graph (KG)A graph DB storing entities (things/concepts) and their relations in triple form. E.g., (Jeff Probst, spouse, Lisa Ann Russell) expresses facts as subject-relation-object.

LinearizationConverting structured data like tables or KGs into natural language strings that LLMs can read. E.g., serializing a table row as 'row 1: (year, 1896), (city, Athens)'.

IRR (Iterative Reading-then-Reasoning)A loop pattern where the LLM alternates between reading data from interfaces and reasoning — progressively gathering information rather than loading everything at once.

Text-to-SQLAutomatically converting natural language questions into SQL queries. E.g., 'How many employees are in marketing?' → SELECT COUNT(*) FROM employees WHERE dept='marketing'

Zero-shotSolving a task with no examples in the prompt — just instructions and the question.

Related Resources

StructGPT GitHub Repository

Original Abstract (Expand)

In this paper, we study how to improve the zero-shot reasoning ability of large language models~(LLMs) over structured data in a unified way. Inspired by the study on tool augmentation for LLMs, we develop an \emph{Iterative Reading-then-Reasoning~(IRR)} approach for solving question answering tasks based on structured data, called \textbf{StructGPT}. In our approach, we construct the specialized function to collect relevant evidence from structured data (\ie \emph{reading}), and let LLMs concentrate the reasoning task based on the collected information (\ie \emph{reasoning}). Specially, we propose an \emph{invoking-linearization-generation} procedure to support LLMs in reasoning on the structured data with the help of the external interfaces. By iterating this procedures with provided interfaces, our approach can gradually approach the target answer to a given query. Extensive experiments conducted on three types of structured data demonstrate the effectiveness of our approach, which can significantly boost the performance of ChatGPT and achieve comparable performance against the full-data supervised-tuning baselines. Our codes and data are publicly available at~\url{https://github.com/RUCAIBox/StructGPT}.