A Survey on Open Information Extraction from Rule-based Model to Large Language Model
TL;DR Highlight
A survey covering the evolution of OpenIE — extracting relation triples from unstructured text — from 2007 to 2024.
Who Should Read
NLP engineers building Knowledge Graph construction or information extraction pipelines. Developers designing systems to extract structured data from text using LLMs.
Core Mechanics
- OpenIE automatically extracts (entity1, relation, entity2) triples from text without predefined categories — critical for QA, search engines, and Knowledge Graph completion
- Task settings fall into 3 categories: ORTE (direct triple extraction), ORSE (extract relation span given entities), ORC (relation clustering) — each with different trade-offs for different situations
- ChatGPT/GPT-4 zero-shot performance is impressive but still below supervised SOTA (especially hallucination issues on long-tail relations)
- Few-shot ICL (In-Context Learning) and instruction tuning can improve LLM performance on OpenIE — GPT-3.5-TURBO few-shot achieves CaRB F1 52.1
- LLMs are blurring the line between OpenIE and standard IE — the same model can handle both tasks with different schema designs
- Code-LLM approaches (CodeIE, GoLLIE) are emerging as more effective than natural language schemas for structured output generation
Evidence
- On CaRB benchmark: pre-neural best model OPENIE4 at F1 51.6, neural MacroIE at F1 54.8, OIE@OIA at F1 71.6 on OIE16
- GPT-3.5-TURBO few-shot ICL achieves F1 67.9 on Re-OIE16 and F1 52.1 on CaRB — partially exceeding rule-based ClausIE (Re-OIE16 F1 64.2)
- On ORC task, CaPL (semi-supervised) achieves FewRel ARI 79.4, B3 81.9 — huge gap vs unsupervised PromptORE (ARI 43.4)
- LSOIE dataset claimed to be 20x larger than the second-largest human-annotated OpenIE dataset (wiki 56,662 + sci 97,550 tuples)
How to Apply
- When building LLM-based structured information extraction systems, try Code-LLM approaches (GoLLIE, CodeIE style) — define schemas as Python classes instead of natural language for more stable output format
- If LLM hallucination is a problem in your OpenIE pipeline, add a traditional sequence labeling model as a supervisory signal or attach an uncertainty quantification module (Ling et al. 2023 style) to validate confidence
- If the goal is automated Knowledge Graph construction, use LLMs as annotators to synthesize training data (LLMaAA pattern), then fine-tune a small specialized model — cost-efficient approach
Code Example
# OpenIE Triple Extraction with GPT-3.5-TURBO (few-shot ICL approach)
from openai import OpenAI
client = OpenAI()
few_shot_examples = """
Text: "Barack Obama was born in Hawaii and served as the 44th president of the United States."
Triples:
- (Barack Obama, was born in, Hawaii)
- (Barack Obama, served as, 44th president of the United States)
Text: "Apple was founded by Steve Jobs in 1976 in Cupertino."
Triples:
- (Apple, was founded by, Steve Jobs)
- (Apple, was founded in, 1976)
- (Apple, was founded in, Cupertino)
"""
def extract_triples(text: str) -> str:
prompt = f"""Extract all relational triples (subject, relation, object) from the given text.
Output each triple on a new line in the format: (subject, relation, object)
Examples:
{few_shot_examples}
Now extract triples from this text:
Text: "{text}"
Triples:"""
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are an Open Information Extraction system. Extract factual relational triples from text."},
{"role": "user", "content": prompt}
],
temperature=0.0
)
return response.choices[0].message.content
# Usage example
text = "Elon Musk founded SpaceX in 2002 and acquired Twitter in 2022."
result = extract_triples(text)
print(result)
# Expected output:
# (Elon Musk, founded, SpaceX)
# (SpaceX, was founded in, 2002)
# (Elon Musk, acquired, Twitter)
# (Twitter, was acquired in, 2022)Terminology
Original Abstract (Expand)
Open Information Extraction (OpenIE) represents a crucial NLP task aimed at deriving structured information from unstructured text, unrestricted by relation type or domain. This survey paper provides an overview of OpenIE technologies spanning from 2007 to 2024, emphasizing a chronological perspective absent in prior surveys. It examines the evolution of task settings in OpenIE to align with the advances in recent technologies. The paper categorizes OpenIE approaches into rule-based, neural, and pre-trained large language models, discussing each within a chronological framework. Additionally, it highlights prevalent datasets and evaluation metrics currently in use. Building on this extensive review, the paper outlines potential future directions in terms of datasets, information sources, output formats, methodologies, and evaluation metrics.