최종 답변을 넘어서: 투명한 Multimodal Reasoning 평가를 위한 CRYSTAL Benchmark

Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation

Mar 13, 2026•Wayner Barrios, SouYoung Jin•View PDF

TL;DR Highlight

멀티모달 AI 모델이 답을 맞혀도 추론 과정이 엉망인지 단계별로 검증하는 벤치마크 CRYSTAL 소개

Who Should Read

멀티모달 LLM(이미지+텍스트 처리 모델)의 성능을 평가하거나 추론 품질을 개선하려는 ML 엔지니어 및 AI 연구자. 특히 모델이 '운 좋게 정답'을 맞히는 문제를 해결하고 싶은 개발자.

Core Mechanics

기존 벤치마크는 최종 답변만 채점해서 '운 좋은 정답'을 진짜 이해와 구분 못함 → CRYSTAL은 중간 추론 단계를 단계별로 검증
6,372개 문제에 평균 11.6개의 검증 가능한 참조 추론 단계를 포함, MathVision·ScienceQA·RealWorldQA 등 5개 소스에서 구축
20개 MLLM 평가 결과 '체리피킹' 현상 거의 만연: 19/20 모델이 precision은 높고 recall은 낮음 (GPT-5조차 참조 단계의 47.9%만 커버)
모델 크기를 키워도 accuracy와 reasoning quality가 함께 오르지 않음 - Gemma3-4B(F1 0.618)가 InternVL3.5-38B(F1 0.612)보다 추론 품질이 높은 사례 존재
어떤 경쟁력 있는 모델도 매칭된 추론 단계의 60% 이상을 올바른 순서로 배치하지 못함 (GPT-5-mini LIS=0.560, 즉 44%가 순서 틀림)
Causal Process Reward(CPR)를 곱셈 방식으로 정답+추론을 연결하면, 덧셈 방식 대비 Match F1 +32% 향상 (GRPO 학습 기준)

Evidence

GPT-5-mini가 Match F1 0.773으로 20개 모델 중 1위지만 정확도는 55.59%로 GPT-5(57.99%)보다 낮음 → accuracy와 reasoning quality 괴리 확인
CPR-Curriculum 적용 시 Qwen2.5-VL-3B 기준 Match F1 0.480→0.633 (+32%), 정확도 39.85%→47.52% (+7.67pp) 동시 향상
InternVL3.5-4B에 CPR-Curriculum 적용 시 Match F1 0.432→0.833 (+93%), recall 0.325→0.811 (약 3배 증가), 정확도 +8.15pp
임베딩 기반 step 매칭의 인간 동의율 84% (Cohen's κ=0.534), threshold 이하 구간에서는 오매칭 0건(100% 일치)

How to Apply

모델 평가 시 최종 답변 accuracy만 보지 말고, 모델에게 JSON 형식으로 reasoning_steps 배열을 출력하게 하고 참조 단계와 semantic similarity로 비교하면 숨겨진 추론 오류를 잡을 수 있음
GRPO 등 RL 기반 학습 시 reward를 단순 덧셈(accuracy + reasoning) 대신 CPR 방식(정답일 때만 step 보너스 full 지급, 오답시 0.3 페널티)으로 설계하면 모델이 추론을 건너뛰는 현상을 방지할 수 있음
커리큘럼 학습이 필요한 경우, Phase 1에서 accuracy만 학습한 뒤 Phase 2에서 step 수가 적은 쉬운 문제부터 점진적으로 난이도를 높이는 CPR-Curriculum 전략을 적용하면 학습 붕괴 없이 안정적으로 추론 품질 향상 가능

Code Example

snippet

# CRYSTAL 평가 프롬프트 (모델에 이 형식으로 출력 요청)
system_prompt = """
You are a vision-language model. Analyze the provided image(s) and user text silently.
Return ONLY a valid JSON object with this schema:
{"reasoning_steps": [], "answer": ""}

Rules for "reasoning_steps":
- Include enough steps to make the answer evident without filler.
- Write single-clause sentences, each adding a new, directly checkable fact.
- No multi-sentence items. No internal monologue.

Rules for "answer":
- Ground strictly in visible content and given text.
- Multiple-choice: return only the best LETTER (e.g., "B").
- Numeric: include units.
"""

# Match F1 계산 예시 (sentence-transformers 사용)
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

model = SentenceTransformer('all-distilroberta-v1')
THRESHOLD = 0.35

def compute_match_f1(predicted_steps, reference_steps):
    if not predicted_steps and not reference_steps:
        return 1.0
    if not predicted_steps or not reference_steps:
        return 0.0
    
    pred_emb = model.encode(predicted_steps)
    ref_emb = model.encode(reference_steps)
    sim_matrix = cosine_similarity(pred_emb, ref_emb)
    
    # Greedy 1:1 매칭
    matched_pred, matched_ref = set(), set()
    pairs = [(sim_matrix[i,j], i, j) 
             for i in range(len(predicted_steps)) 
             for j in range(len(reference_steps)) 
             if sim_matrix[i,j] >= THRESHOLD]
    pairs.sort(reverse=True)
    
    for score, i, j in pairs:
        if i not in matched_pred and j not in matched_ref:
            matched_pred.add(i)
            matched_ref.add(j)
    
    tp = len(matched_pred)
    precision = tp / max(len(predicted_steps), 1)
    recall = tp / max(len(reference_steps), 1)
    
    if precision + recall == 0:
        return 0.0
    return 2 * precision * recall / (precision + recall)

# CPR (Causal Process Reward) 계산
def compute_cpr(answer_correct, f1_step, aw=0.65, sw=0.35, lambda_penalty=0.3):
    if answer_correct:
        return aw * 1.0 + sw * f1_step
    else:
        return sw * f1_step * lambda_penalty

Terminology

MLLMMultimodal Large Language Model. 텍스트뿐 아니라 이미지도 함께 이해하는 대형 언어 모델. GPT-4o나 Gemini처럼 사진을 보고 질문에 답하는 AI.

Match F1모델이 생성한 추론 단계와 정답 추론 단계가 얼마나 겹치는지 측정하는 점수. 너무 적게 말해도(recall 낮음), 엉뚱한 말을 해도(precision 낮음) 점수가 깎임.

체리피킹(Cherry-picking)모델이 확실한 단계만 골라서 말하고 중간 과정은 생략하는 현상. 발표할 때 좋은 결과만 보여주는 것처럼, AI도 자신 있는 추론만 출력하고 불확실한 단계는 건너뜀.

GRPOGroup Relative Policy Optimization. AI 모델을 강화학습으로 훈련할 때 별도의 가치 모델 없이 그룹 내 상대적 비교로 학습하는 방법. DeepSeek에서 제안한 기법.

LIS ratioLongest Increasing Subsequence ratio. 추론 단계들이 올바른 순서로 나열되었는지 측정하는 지표. 요리 레시피에서 '재료 준비 → 볶기 → 담기' 순서가 맞는지 확인하는 것과 비슷.

CPRCausal Process Reward. 정답을 맞혔을 때만 추론 단계 보너스를 주는 곱셈형 보상 설계. 단순히 점수를 더하는 방식과 달리, 정답 없이는 추론 보너스도 없어서 모델이 추론을 건너뛸 수 없음.

Delphi method여러 전문가가 독립적으로 의견을 내고, 합의될 때까지 반복적으로 수렴시키는 방법론. 이 논문에서는 4개 AI 모델이 각자 추론 단계를 생성하고 클러스터링으로 합의된 참조 답안을 만드는 데 활용.

semantic clustering의미가 비슷한 문장들을 같은 그룹으로 묶는 기법. '강아지가 달린다'와 '개가 뛰고 있다'를 같은 클러스터로 묶어 중복 제거에 활용.

Original Abstract (Expand)

We introduce **CRYSTAL** (*__C__lear __R__easoning via __Y__ielded __S__teps, __T__raceability and __L__ogic*), a diagnostic benchmark with 6,372 instances that evaluates multimodal reasoning through verifiable intermediate steps. We propose two complementary metrics: *Match F1*, which scores step-level precision and recall via semantic similarity matching, and *Ordered Match F1*, which further penalizes disordered reasoning chains. References are constructed through a Delphi-inspired pipeline where four independent MLLMs generate trajectories, aggregated via semantic clustering and validated through human quality gates. Evaluation of 20 MLLMs, including commercial frontier systems not used during benchmark construction, reveals systematic failures invisible to accuracy: universal cherry-picking (precision far exceeds recall), non-monotonic scaling trade-offs, and disordered reasoning where no competitive model preserves more than 60% of matched steps in correct order. Beyond evaluation, we propose the **Causal Process Reward (CPR)**, a multiplicative reward that couples answer correctness with step-level alignment, and **CPR-Curriculum**, which progressively increases reasoning difficulty during training. CPR-Curriculum achieves +32% Match F1 via GRPO where additive reward strategies fail, improving reasoning without manual step annotation.