SLOT: LLM 출력을 구조화된 포맷으로 변환하는 범용 후처리 레이어

SLOT: Structuring the Output of Large Language Models

May 6, 2025•D. Wang, Zhengyuan Shen, Soumya Mishra +3•View PDF

TL;DR Highlight

어떤 LLM이든 상관없이 출력을 JSON으로 바꿔주는 경량 후처리 모델 — Mistral-7B 파인튜닝만으로 Claude-3.5-Sonnet보다 25%p 높은 스키마 정확도.

Who Should Read

function calling, 에이전트, 정보 추출 파이프라인에서 LLM 출력이 JSON 스키마를 안 지켜서 파싱 오류가 터지는 백엔드/ML 엔지니어. 특히 여러 LLM을 동시에 서빙하는 플랫폼 팀.

Core Mechanics

LLM 출력을 그대로 JSON으로 변환하는 별도 경량 모델(SLOT)을 후처리 레이어로 붙이는 아이디어 — 기존 LLM 가중치를 건드리지 않아서 어떤 모델에도 적용 가능
Mistral-7B를 LoRA(적은 파라미터만 학습하는 기법)로 파인튜닝하면 스키마 정확도 98.2%, 콘텐츠 유사도 92.9% — Claude-3.5-Sonnet 대비 각각 +23.5%p, +19%p
Llama-3.2-1B 같은 초소형 모델도 SLOT 적용 후 Claude-3.5-Haiku 수준의 스키마 정확도(88.9%)와 그 이상의 콘텐츠 유사도(81.7%) 달성
SFT + XGrammar(제약 디코딩) 조합이 최강 — Mistral-7B 기준 99.5% 스키마 정확도, 94.0% 콘텐츠 유사도로 거의 완벽
Claude-3.5-Sonnet으로 생성한 합성 데이터 126K개로 학습 — 공개 데이터셋만 쓸 때보다 스키마 정확도 26%p 이상 향상
평가 지표도 새로 제안: 스키마 정확도(키/타입 일치) + Sentence-BERT 기반 콘텐츠 유사도(soft F1) 두 축으로 측정

Evidence

Mistral-7B + SFT + XGrammar: 스키마 정확도 99.5%, 콘텐츠 유사도 94.0% vs Claude-3.5-Sonnet 74.7% / 73.9%
Llama-3.2-1B + SFT만으로도 스키마 정확도 88.9% — Claude-3.5-Haiku(89.0%)와 동급, 콘텐츠 유사도는 81.7%로 Sonnet(73.9%)을 능가
GitHub Issues(가장 복잡한 중첩 구조) 데이터셋에서 Mistral-7B가 파인튜닝 전 0% → SFT 후 93.1% 스키마 정확도로 개선
합성 데이터 단독 학습이 공개 데이터 단독 학습보다 스키마 정확도 26.5%p 높음(Llama-3.2-1B 기준: 89.6% vs 63.1%)

How to Apply

기존 LLM API 호출 파이프라인에 SLOT을 후처리 레이어로 추가: LLM 응답 텍스트 + JSON 스키마를 SLOT에 넣으면 구조화된 JSON 출력. LLM 교체나 재학습 없이 적용 가능.
온프레미스나 엣지 환경처럼 리소스가 제한된 곳에서 GPT-4o 없이도 안정적인 structured output이 필요하면, Llama-3.2-1B나 Mistral-7B를 SLOT 방식으로 파인튜닝해서 배포.
데이터 합성 파이프라인을 직접 구축할 때: industry vertical, JSON complexity, text type 등 5개 차원을 샘플링해서 Claude로 (input_text, json_schema, gold) 트리플을 생성하고, LLM validator로 hallucination 필터링하는 2단계 검증 적용.

Code Example

snippet

# SLOT 스타일 직접 프롬프팅 (파인튜닝 없이 테스트해볼 때)
prompt = """
Convert the following text into JSON format according to the specified schema.
Ensure that both keys and values are strings, even for numerical values.

Text: {input_text}

Provide your response in the following JSON format: {json_schema}

Please output ONLY the JSON structure and extract the attributes only present in the schema.
Output:
"""

# 예시 사용
input_text = "Apple reported Q3 revenue of $89.5B, up 5% YoY. iPhone sales drove growth."
json_schema = {
    "type": "object",
    "properties": {
        "company": {"type": "string"},
        "quarter": {"type": "string"},
        "revenue_billion": {"type": "string"},
        "yoy_growth": {"type": "string"},
        "growth_driver": {"type": "string"}
    },
    "required": ["company", "quarter", "revenue_billion", "yoy_growth", "growth_driver"]
}

formatted_prompt = prompt.format(
    input_text=input_text,
    json_schema=json.dumps(json_schema)
)
# → 이 프롬프트를 어떤 LLM에든 넣으면 SLOT의 기본 동작 재현

Terminology

SFT모범답안 예시를 보여주고 따라하게 하는 지도학습. 학교에서 풀이 보고 따라 푸는 것과 같음.

LoRA모델 전체를 재학습하지 않고 작은 어댑터 행렬만 끼워서 특정 능력을 추가하는 경량 파인튜닝 기법. GPU 메모리를 훨씬 적게 써서 7B 모델도 소비자 GPU로 학습 가능.

constrained decodingLLM이 토큰을 생성할 때 문법 규칙에 맞는 토큰만 선택하도록 강제하는 방식. 마치 자동완성이 문법에 맞는 단어만 추천하는 것처럼.

JSON SchemaJSON 데이터의 구조(키 이름, 값 타입, 필수 필드 등)를 선언적으로 정의하는 명세. API 입출력 검증에 흔히 쓰임.

Sentence-BERT문장을 의미 벡터로 변환하는 모델. 두 문장이 의미적으로 얼마나 비슷한지 코사인 유사도로 계산할 수 있음.

XGrammar바이트 레벨 푸시다운 오토마타를 쓴 제약 디코딩 엔진. 복잡한 중첩 JSON 스키마도 빠르게 처리함.

soft-precision / soft-recall완전 일치가 아니라 의미 유사도 기반으로 계산하는 정밀도/재현율. 예측값이 정답과 완전히 같지 않아도 의미가 비슷하면 부분 점수를 줌.

Related Resources

Original Abstract (Expand)

Structured outputs are essential for large language models (LLMs) in critical applications like agents and information extraction. Despite their capabilities, LLMs often generate outputs that deviate from predefined schemas, significantly hampering reliable application development. We present SLOT (Structured LLM Output Transformer), a model-agnostic approach that transforms unstructured LLM outputs into precise structured formats. While existing solutions predominantly rely on constrained decoding techniques or are tightly coupled with specific models, SLOT employs a fine-tuned lightweight language model as a post-processing layer, achieving flexibility across various LLMs and schema specifications. We introduce a systematic pipeline for data curation and synthesis alongside a formal evaluation methodology that quantifies both schema accuracy and content fidelity. Our results demonstrate that fine-tuned Mistral-7B model with constrained decoding achieves near perfect schema accuracy (99.5%) and content similarity (94.0%), outperforming Claude-3.5-Sonnet by substantial margins (+25 and +20 percentage points, respectively). Notably, even compact models like Llama-3.2-1B can match or exceed the structured output capabilities of much larger proprietary models when equipped with SLOT, enabling reliable structured generation in resource-constrained environments.