Rule-based Model부터 Large Language Model까지: Open Information Extraction 서베이

A Survey on Open Information Extraction from Rule-based Model to Large Language Model

Aug 18, 2022•Pai Liu, Wenyang Gao, Wen Dong +3•View PDF

TL;DR Highlight

2007년부터 2024년까지 비정형 텍스트에서 관계 트리플을 추출하는 OpenIE 기술의 진화를 한눈에 정리한 서베이.

Who Should Read

Knowledge Graph 구축이나 정보 추출 파이프라인을 개발하는 NLP 엔지니어. LLM을 활용해 구조화된 데이터를 뽑아내는 시스템을 설계하는 개발자.

Core Mechanics

OpenIE는 사전 정의된 카테고리 없이 텍스트에서 (entity1, relation, entity2) 트리플을 자동 추출하는 태스크로, QA·검색엔진·Knowledge Graph 완성에 핵심적으로 쓰임
태스크 세팅이 3가지로 분류됨: ORTE(직접 트리플 추출), ORSE(엔티티 주어지면 관계 스팬 추출), ORC(관계 클러스터링) — 각각 장단점이 달라 상황에 맞게 선택 필요
ChatGPT·GPT-4 같은 LLM의 zero-shot 성능은 인상적이지만, 지도학습 기반 SOTA 모델보다 아직 낮음 (특히 long-tail 관계에서 할루시네이션 문제 발생)
few-shot ICL(In-Context Learning)과 instruction tuning으로 LLM의 OpenIE 성능을 끌어올릴 수 있음 — GPT-3.5-TURBO few-shot이 CaRB F1 52.1 달성
LLM 등장으로 OpenIE와 standard IE의 경계가 흐려지는 중 — 스키마 설계만 달리하면 동일 모델로 두 태스크 모두 처리 가능
Code-LLM 기반 접근법(CodeIE, GoLLIE)이 자연어 스키마보다 구조화된 출력 생성에 효과적이라는 트렌드 부상

Evidence

CaRB 벤치마크에서 Pre-neural 최고 모델 OPENIE4가 F1 51.6인 반면, Neural 모델 MacroIE는 F1 54.8, OIE@OIA는 OIE16에서 F1 71.6 달성
GPT-3.5-TURBO few-shot ICL이 Re-OIE16에서 F1 67.9, CaRB에서 F1 52.1로 전통적 rule-based 모델(ClausIE Re-OIE16 F1 64.2)을 일부 초과
ORC 태스크에서 CaPL(semi-supervised)이 FewRel ARI 79.4, B3 81.9로 최고 성능 달성 — unsupervised PromptORE(ARI 43.4) 대비 큰 차이
LSOIE 데이터셋은 인간 주석 OpenIE 데이터셋 중 두 번째로 큰 것보다 20배 크다고 주장 (wiki 56,662 + sci 97,550 튜플)

How to Apply

LLM으로 구조화 정보 추출 시스템을 만든다면, 자연어 스키마 대신 Python 클래스로 스키마를 정의하는 Code-LLM 접근법(GoLLIE, CodeIE 방식)을 시도해보면 출력 형식 안정성이 높아짐
OpenIE 파이프라인에서 LLM 할루시네이션이 문제라면, 전통적 sequence labeling 모델을 supervisory signal로 추가하거나 uncertainty quantification 모듈(Ling et al. 2023 방식)을 붙여서 신뢰도를 검증하는 구조로 보완 가능
Knowledge Graph 자동 구축이 목표라면, LLM을 annotator로 써서 학습 데이터를 합성하고(LLMaAA 패턴), 작은 specialized 모델을 fine-tuning하는 방식이 비용 대비 효율적

Code Example

snippet

# GPT-3.5-TURBO로 OpenIE 트리플 추출 (few-shot ICL 방식)
from openai import OpenAI

client = OpenAI()

few_shot_examples = """
Text: "Barack Obama was born in Hawaii and served as the 44th president of the United States."
Triples:
- (Barack Obama, was born in, Hawaii)
- (Barack Obama, served as, 44th president of the United States)

Text: "Apple was founded by Steve Jobs in 1976 in Cupertino."
Triples:
- (Apple, was founded by, Steve Jobs)
- (Apple, was founded in, 1976)
- (Apple, was founded in, Cupertino)
"""

def extract_triples(text: str) -> str:
    prompt = f"""Extract all relational triples (subject, relation, object) from the given text.
    Output each triple on a new line in the format: (subject, relation, object)
    
    Examples:
    {few_shot_examples}
    
    Now extract triples from this text:
    Text: "{text}"
    Triples:"""
    
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are an Open Information Extraction system. Extract factual relational triples from text."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.0
    )
    return response.choices[0].message.content

# 사용 예시
text = "Elon Musk founded SpaceX in 2002 and acquired Twitter in 2022."
result = extract_triples(text)
print(result)
# 예상 출력:
# (Elon Musk, founded, SpaceX)
# (SpaceX, was founded in, 2002)
# (Elon Musk, acquired, Twitter)
# (Twitter, was acquired in, 2022)

Terminology

OpenIE사전에 관계 유형을 정해두지 않고 텍스트에서 자동으로 관계를 뽑아내는 기술. 일반 IE가 '고용관계만 찾아라'처럼 제한이 있다면, OpenIE는 '모든 관계를 다 찾아라'는 방식.

Knowledge Graph엔티티(개체)들과 그 사이의 관계를 그래프 형태로 저장한 데이터베이스. '오바마 → 출생지 → 하와이' 같은 트리플들의 집합.

Sequence Labeling문장의 각 단어에 태그를 붙이는 방식. 예를 들어 '오바마[주어] 는[X] 하와이[목적어]에서 태어났다[관계]' 식으로 라벨링.

ICL (In-Context Learning)LLM에게 예시를 몇 개 프롬프트에 넣어주면 별도 학습 없이 패턴을 따라하는 능력. 모델 가중치를 바꾸지 않고 예시만으로 동작.

HallucinationLLM이 실제 입력에 없는 내용을 그럴듯하게 지어내는 현상. 트리플 추출에서 원문에 없는 관계를 만들어내는 문제.

Instruction Tuning모델이 자연어 명령을 잘 따르도록 (명령, 정답) 쌍으로 추가 학습하는 방법. ChatGPT가 지시를 잘 따르는 이유 중 하나.

ORC (Open Relation Clustering)명시적 관계 레이블 없이 비슷한 관계 표현끼리 묶어주는 방식. 'created by', 'founded by', 'established by'를 같은 클러스터로 그룹화.

UIE (Universal Information Extraction)NER, 관계추출, 이벤트추출 등 다양한 정보추출 태스크를 하나의 통합 프레임워크로 처리하려는 접근법.

Original Abstract (Expand)

Open Information Extraction (OpenIE) represents a crucial NLP task aimed at deriving structured information from unstructured text, unrestricted by relation type or domain. This survey paper provides an overview of OpenIE technologies spanning from 2007 to 2024, emphasizing a chronological perspective absent in prior surveys. It examines the evolution of task settings in OpenIE to align with the advances in recent technologies. The paper categorizes OpenIE approaches into rule-based, neural, and pre-trained large language models, discussing each within a chronological framework. Additionally, it highlights prevalent datasets and evaluation metrics currently in use. Building on this extensive review, the paper outlines potential future directions in terms of datasets, information sources, output formats, methodologies, and evaluation metrics.