A Survey on Open Information Extraction from Rule-based Model to Large Language Model

Aug 18, 2022•Pai Liu, Wenyang Gao, Wen Dong +3•View PDF

TL;DR Highlight

A survey covering the evolution of OpenIE — extracting relation triples from unstructured text — from 2007 to 2024.

Who Should Read

NLP engineers building Knowledge Graph construction or information extraction pipelines. Developers designing systems to extract structured data from text using LLMs.

Core Mechanics

OpenIE automatically extracts (entity1, relation, entity2) triples from text without predefined categories — critical for QA, search engines, and Knowledge Graph completion
Task settings fall into 3 categories: ORTE (direct triple extraction), ORSE (extract relation span given entities), ORC (relation clustering) — each with different trade-offs for different situations
ChatGPT/GPT-4 zero-shot performance is impressive but still below supervised SOTA (especially hallucination issues on long-tail relations)
Few-shot ICL (In-Context Learning) and instruction tuning can improve LLM performance on OpenIE — GPT-3.5-TURBO few-shot achieves CaRB F1 52.1
LLMs are blurring the line between OpenIE and standard IE — the same model can handle both tasks with different schema designs
Code-LLM approaches (CodeIE, GoLLIE) are emerging as more effective than natural language schemas for structured output generation

Evidence

On CaRB benchmark: pre-neural best model OPENIE4 at F1 51.6, neural MacroIE at F1 54.8, OIE@OIA at F1 71.6 on OIE16
GPT-3.5-TURBO few-shot ICL achieves F1 67.9 on Re-OIE16 and F1 52.1 on CaRB — partially exceeding rule-based ClausIE (Re-OIE16 F1 64.2)
On ORC task, CaPL (semi-supervised) achieves FewRel ARI 79.4, B3 81.9 — huge gap vs unsupervised PromptORE (ARI 43.4)
LSOIE dataset claimed to be 20x larger than the second-largest human-annotated OpenIE dataset (wiki 56,662 + sci 97,550 tuples)

How to Apply

When building LLM-based structured information extraction systems, try Code-LLM approaches (GoLLIE, CodeIE style) — define schemas as Python classes instead of natural language for more stable output format
If LLM hallucination is a problem in your OpenIE pipeline, add a traditional sequence labeling model as a supervisory signal or attach an uncertainty quantification module (Ling et al. 2023 style) to validate confidence
If the goal is automated Knowledge Graph construction, use LLMs as annotators to synthesize training data (LLMaAA pattern), then fine-tune a small specialized model — cost-efficient approach

Code Example

snippet

# OpenIE Triple Extraction with GPT-3.5-TURBO (few-shot ICL approach)
from openai import OpenAI

client = OpenAI()

few_shot_examples = """
Text: "Barack Obama was born in Hawaii and served as the 44th president of the United States."
Triples:
- (Barack Obama, was born in, Hawaii)
- (Barack Obama, served as, 44th president of the United States)

Text: "Apple was founded by Steve Jobs in 1976 in Cupertino."
Triples:
- (Apple, was founded by, Steve Jobs)
- (Apple, was founded in, 1976)
- (Apple, was founded in, Cupertino)
"""

def extract_triples(text: str) -> str:
    prompt = f"""Extract all relational triples (subject, relation, object) from the given text.
    Output each triple on a new line in the format: (subject, relation, object)
    
    Examples:
    {few_shot_examples}
    
    Now extract triples from this text:
    Text: "{text}"
    Triples:"""
    
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are an Open Information Extraction system. Extract factual relational triples from text."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.0
    )
    return response.choices[0].message.content

# Usage example
text = "Elon Musk founded SpaceX in 2002 and acquired Twitter in 2022."
result = extract_triples(text)
print(result)
# Expected output:
# (Elon Musk, founded, SpaceX)
# (SpaceX, was founded in, 2002)
# (Elon Musk, acquired, Twitter)
# (Twitter, was acquired in, 2022)

Terminology

OpenIEA technique that automatically extracts relationships from text without predefined relation types. Unlike regular IE ('find only employment relations'), OpenIE says 'find all relations.'

Knowledge GraphA database storing entities and their relationships in graph form. Triples like 'Obama → birthplace → Hawaii' connected together.

Triple(Subject, Relation, Object) — the basic knowledge unit. E.g., (Obama, birthplace, Hawaii).

ICL (In-Context Learning)Making an LLM perform a new task by showing it examples in the prompt, without changing model parameters.

HallucinationWhen an LLM confidently generates non-existent facts. A key problem in OpenIE for long-tail or rare relations.

Original Abstract (Expand)

Open Information Extraction (OpenIE) represents a crucial NLP task aimed at deriving structured information from unstructured text, unrestricted by relation type or domain. This survey paper provides an overview of OpenIE technologies spanning from 2007 to 2024, emphasizing a chronological perspective absent in prior surveys. It examines the evolution of task settings in OpenIE to align with the advances in recent technologies. The paper categorizes OpenIE approaches into rule-based, neural, and pre-trained large language models, discussing each within a chronological framework. Additionally, it highlights prevalent datasets and evaluation metrics currently in use. Building on this extensive review, the paper outlines potential future directions in terms of datasets, information sources, output formats, methodologies, and evaluation metrics.