LLM 출력 형식 제어를 위한 Verifiable Format Control

Verifiable Format Control for Large Language Model Generations

Feb 6, 2025•Zhaoyang Wang, Jinqi Jiang, Huichi Zhou +4•View PDF

TL;DR Highlight

7B 소형 LLM이 JSON 등 포맷 지시를 제대로 못 따르는 문제를 Python 검증 함수 기반 데이터셋과 점진적 학습으로 해결한 연구

Who Should Read

LLM API 응답을 JSON이나 특정 포맷으로 파싱해야 하는 백엔드 개발자, 또는 소형 오픈소스 LLM을 파인튜닝해서 프로덕션에 배포하려는 ML 엔지니어

Core Mechanics

7B급 오픈소스 LLM(Mistral, LLaMA-2/3)은 GPT-4와 달리 JSON 같은 세밀한 포맷 지시를 잘 못 따름 - 특히 제약 조건이 2~3개 겹치면 성능이 확 떨어짐
포맷 검증을 GPT-4o로 하면 정확도 70%, 비용 $2.38/200샘플인데, Python 함수로 하면 정확도 100%, 비용 $0, 속도 100배 빠름
VFF(Verifiable Format Following) 데이터셋: Python bool 함수로 정답 여부를 자동 검증할 수 있는 60개 메타 제약 조건 + 52K 질문 조합으로 구성
자기 개선(self-improvement) 방식: 모델이 직접 생성한 응답을 Python 함수로 채점해 SFT + DPO(선호도 학습) 데이터로 활용 - 외부 LLM API 호출 불필요
level-1(제약 1개) → level-2 → level-3(제약 3개) 순서로 점진적으로 학습시키면 LLaMA-3-8B가 level-3에서 GPT-4-turbo를 앞지름(38.36% vs 35.31%)
GPT-4o도 동일 질문에 대해 포맷 판정을 최대 25~52% 비율로 다르게 내리는 inconsistency 문제 있음

Evidence

LLaMA-3-8B 학습 후 VFF level-3 정확도: 기본 15.81% → 학습 후 38.36% (GPT-4-turbo 35.31% 초과)
Python 기반 판정 vs GPT-4o 기반 판정: 정확도 100% vs 70%, 처리 시간 0.52초 vs 205.10초, 비용 $0 vs $2.383/200샘플
GPT-4o의 포맷 판정 일관성 테스트: 동일 질문 50회 반복 시 temperature 1.0에서 48% inconsistency rate, 10.15회 판정 flip 발생
SFT+DPO 조합 점진적 학습이 DPO-Only보다 일관되게 우수: level-3 기준 38.36% vs 17.95%

How to Apply

소형 LLM을 JSON 응답 전용으로 파인튜닝할 때: VFF 데이터셋(huggingface.co/datasets/jinqij/VFF)에서 'Limited Structure' 제약 조건 샘플만 필터링해 SFT 데이터로 사용하면 됨
자체 포맷 제약 조건이 있다면: '제약 조건 텍스트 + 변수 후보값 + Python 검증 함수' 형태로 메타 제약 조건을 직접 정의하고, 모델이 생성한 응답을 Python으로 자동 채점해 DPO 쌍 데이터를 무한 생성 가능
LLaMA-Factory를 이미 쓰고 있다면: 논문의 학습 설정(LoRA rank=64, α=128, lr=5e-6, AdamW, cosine scheduler, 8 epoch)을 그대로 적용해 SFTTrainer → DPOTrainer 순서로 학습하면 됨

Code Example

snippet

# Python 검증 함수 예시 - JSON 포맷 준수 여부 자동 체크
import json

def verify_json_format(response_text, vars, type=0):
    try:
        response_text = fr'''{response_text}'''
        json_object = json.loads(response_text)
    except ValueError:
        return False
    return True

# 단어 수 제한 검증 함수
def verify_word_limit(response_text, vars, type=0):
    word_limit = int(vars[0])  # vars[0] = 30, 50, 100 등
    word_count = len(response_text.split())
    meets_criteria = word_count <= word_limit
    if type == 0:
        return meets_criteria
    else:
        if meets_criteria:
            return 1
        else:
            return 1 - (word_count - word_limit) / word_limit

# 여러 제약 조건을 AND로 결합 (level-c 검증)
def verify_all_constraints(response, constraint_fns, vars_list):
    # 모든 제약 조건을 통과해야 I=1
    return all(fn(response, v) for fn, v in zip(constraint_fns, vars_list))

# 사용 예시
response = '{"answer": "Paris"}'
print(verify_json_format(response, []))  # True

response_long = "This is a very long response with many many words"
print(verify_word_limit(response_long, [5]))  # False

Terminology

DPO모델에게 '이 답변이 저 답변보다 좋다'는 쌍을 보여줘서 선호도를 학습시키는 기법. 정답/오답 쌍을 비교해서 더 나은 쪽으로 모델을 밀어주는 방식.

SFT정답 예시를 직접 보여주고 따라하게 하는 지도학습. 학교에서 모범답안 보고 그대로 따라 쓰는 것과 비슷.

LoRA모델 전체를 다시 학습하지 않고 작은 어댑터 레이어만 추가해서 학습하는 파인튜닝 기법. 원본 모델은 그대로 두고 보조 모듈만 업데이트해서 계산량을 대폭 줄임.

메타 제약 조건구체적인 값이 채워지기 전의 제약 조건 템플릿. 예: 'VAR1 언어로 응답하라'에서 VAR1 자리에 'English', 'Spanish' 등을 채워 실제 제약 조건을 만드는 방식.

self-improvement모델이 스스로 데이터를 생성하고 그 데이터로 자신을 학습시키는 패러다임. 외부 사람이나 GPT-4 없이 모델 혼자 좋은 답/나쁜 답을 만들어 훈련 데이터로 씀.

IFEvalLLM이 포맷 지시를 얼마나 잘 따르는지 Python 코드로 자동 평가하는 벤치마크. 약 500개 테스트 샘플로 구성.

constrained decodingLLM이 텍스트를 생성할 때 문법이나 스키마를 강제로 지키게 하는 디코딩 기법. JSON 스키마를 미리 정의해두면 그 형식을 벗어난 토큰은 아예 생성 못하게 막는 방식.

Related Resources

VFF 데이터셋 (HuggingFace)

Original Abstract (Expand)

Recent Large Language Models (LLMs) have demonstrated satisfying general instruction following ability. However, small LLMs with about 7B parameters still struggle fine-grained format following (e.g., JSON format), which seriously hinder the advancements of their applications. Most existing methods focus on benchmarking general instruction following while overlook how to improve the specific format following ability for small LLMs. Besides, these methods often rely on evaluations based on advanced LLMs (e.g., GPT-4), which can introduce the intrinsic bias of LLMs and be costly due to the API calls. In this paper, we first curate a fully verifiable format following dataset VFF. In contrast to existing works often adopting external LLMs for instruction-following validations, every sample of VFF can be easily validated with a Python function. Further, we propose to leverage this verifiable feature to synthesize massive data for progressively training small LLMs, in order to improve their format following abilities. Experimental results highlight the prevalent limitations in the format following capabilities of 7B level open-source LLMs and demonstrate the effectiveness of our method in enhancing this essential ability.