CHiL(L)Grader: Calibrated Human-in-the-Loop 단답형 자동 채점 프레임워크

CHiL(L)Grader: Calibrated Human-in-the-Loop Short-Answer Grading

Mar 12, 2026•Pranav Raikote, Korbinian Randl, Ioanna Miliou +2•View PDF

TL;DR Highlight

LLM이 확신할 때만 자동 채점하고, 불확실한 답안은 교사에게 넘기는 Human-in-the-Loop 채점 시스템

Who Should Read

LLM 기반 자동 채점 또는 평가 파이프라인을 구축하는 개발자. 특히 AI 예측 신뢰도를 기반으로 사람 검토를 트리거하는 워크플로우를 설계하는 EdTech 백엔드 개발자.

Core Mechanics

LLM은 틀린 예측에도 0.99에 가까운 자신감을 보이는 과신(overconfidence) 문제가 있는데, temperature scaling(모델 출력 확률을 보정하는 기법)으로 ECE를 최대 65% 줄여 신뢰할 수 있는 confidence score를 만들어냄
confidence threshold(τ)를 기준으로 확신 있는 예측만 자동 채점하고, 나머지는 교사에게 라우팅 — DAMI 데이터셋에서 상위 35% 예측이 QWK 0.882(전문가 수준)를 달성
자동 수락된 예측(QWK 0.882)과 거부된 예측(QWK 0.535) 사이 +0.347 QWK 격차가 확인되어, confidence gate가 실제로 좋은 예측과 나쁜 예측을 잘 구분함
교사 교정 데이터 + replay buffer(과거 데이터 재학습으로 망각 방지)로 LoRA 파인튜닝을 반복 → 새로운 문제 유형에도 적응하면서 기존 성능 유지
replay buffer를 제거하면 catastrophic forgetting(새 데이터 학습 시 기존 지식 망각)으로 QWK가 0.025까지 붕괴 — replay buffer가 필수적임을 확인
기반 모델로 Qwen-2.5-7B-Instruct를 사용했을 때 grade bias(평균 +0.03)가 거의 없어 배포 가능한 유일한 후보였고, Llama-3.1-8B는 평균 1.87점 과잉 채점으로 탈락

Evidence

DAMI 데이터셋 기준 1회 HiL 사이클 후 QWK 0.458 → 0.882 (+0.424 향상), 전체의 35.1% 자동 채점 달성
Temperature scaling으로 ECE 65% 감소 (DAMI: 0.270 → 0.094), EngSAF 53% 감소 (0.097 → 0.046)
자동 수락 예측 vs 거부 예측 QWK 격차 +0.347 (0.882 vs 0.535), MAE 기준 3.1배 차이
Zero-shot QWK 0.289 대비 CHiL(L)Grader QWK 0.882로 약 3배 성능 향상, Few-shot k=5(0.603)와 RAG(0.443)도 크게 상회

How to Apply

기존 LLM 채점/분류 파이프라인에 temperature scaling을 추가해 calibrated confidence를 얻고, threshold τ를 설정해 낮은 confidence 케이스를 인간 리뷰 큐에 넣는 라우팅 레이어를 구현할 수 있음
교사/운영자가 교정한 데이터를 replay buffer(과거 학습 데이터 일부 포함)와 합쳐 주기적으로 LoRA 어댑터만 파인튜닝하면, 전체 모델 재학습 없이 새로운 패턴에 적응하는 continual learning 루프를 구성할 수 있음
confidence threshold τ는 재학습 없이 조정 가능한 노브(knob)로 활용 — 자동화율을 높이고 싶으면 τ를 낮추고, 정확도를 높이려면 τ를 올리면 됨 (EngSAF 예: τ=0.5에서 커버리지 90%, τ=0.8에서 커버리지 44.6%지만 QWK 0.840)

Code Example

snippet

# CHiL(L)Grader 핵심 로직: Temperature Scaling + Selective Prediction
import torch
import torch.nn.functional as F
from scipy.optimize import minimize_scalar
import numpy as np

def compute_ece(confidences, accuracies, n_bins=10):
    """Expected Calibration Error 계산"""
    bins = np.linspace(0, 1, n_bins + 1)
    ece = 0.0
    for i in range(n_bins):
        mask = (confidences >= bins[i]) & (confidences < bins[i+1])
        if mask.sum() == 0:
            continue
        bin_acc = accuracies[mask].mean()
        bin_conf = confidences[mask].mean()
        ece += (mask.sum() / len(confidences)) * abs(bin_acc - bin_conf)
    return ece

def find_optimal_temperature(logits, labels, n_bins=10):
    """ECE를 최소화하는 최적 temperature 탐색"""
    def ece_loss(T):
        scaled_probs = F.softmax(torch.tensor(logits) / T, dim=-1).numpy()
        confidences = scaled_probs.max(axis=-1)
        predictions = scaled_probs.argmax(axis=-1)
        accuracies = (predictions == labels).astype(float)
        return compute_ece(confidences, accuracies, n_bins)
    
    result = minimize_scalar(ece_loss, bounds=(0.1, 2.0), method='bounded')
    return result.x

def selective_predict(logits, threshold_tau, optimal_T):
    """
    Calibrated confidence 기반 selective prediction
    Returns: (grade, confidence, should_auto_accept)
    """
    scaled_probs = F.softmax(torch.tensor(logits) / optimal_T, dim=-1)
    confidence = scaled_probs.max().item()
    predicted_grade = scaled_probs.argmax().item()
    auto_accept = confidence >= threshold_tau
    return predicted_grade, confidence, auto_accept

# 프롬프트 템플릿 (Basic)
GRADING_PROMPT = """System: You are an Automated Short Answer Grader (ASAG). 
Return ONLY a strict JSON with keys: "grade" (int), "max_grade" (int).

User:
Question: {question}
Answer: {answer}
Target Scale: 0 to {max_grade}"""

# 사용 예시
# T_star = find_optimal_temperature(cal_logits, cal_labels)
# grade, conf, accept = selective_predict(logits, tau=0.4, optimal_T=T_star)
# if accept:
#     final_grades.append(grade)
# else:
#     human_review_queue.append((question, answer, grade))  # 교사 검토 큐에 추가

Terminology

QWKQuadratic Weighted Kappa의 약자. 두 채점자(AI와 사람)가 얼마나 일치하는지 측정하는 지표. 0.8 이상이면 사람 채점자 수준과 동등하다고 봄.

ECEExpected Calibration Error. 모델이 '80% 확신한다'고 할 때 실제로 80%의 경우에 맞는지 측정하는 지표. 0에 가까울수록 신뢰도가 정확함.

Temperature Scaling모델 출력 확률을 하나의 숫자(T)로 나눠서 confidence를 보정하는 기법. T>1이면 확률을 낮춰 과신 방지, T<1이면 확률을 높임.

Selective Prediction모델이 확신할 때만 예측을 '수락'하고, 불확실할 때는 사람에게 넘기는 방식. 정확도와 자동화율 사이의 trade-off를 조절 가능.

Catastrophic Forgetting새로운 데이터로 재학습할 때 이전에 배운 것을 갑자기 잊어버리는 현상. 사람으로 치면 새 언어를 배우다 모국어를 잊는 것과 비슷.

LoRA모델 전체를 다시 학습하지 않고, 작은 어댑터 레이어만 추가해서 학습하는 기법. 비용이 훨씬 적게 들면서 특정 태스크에 적응 가능.

Continual Learning새로운 데이터가 들어올 때마다 모델을 점진적으로 업데이트하는 학습 방식. 처음부터 재학습하지 않아도 새 패턴에 계속 적응함.

Human-in-the-LoopAI가 모든 걸 혼자 결정하지 않고, 불확실한 케이스에서 사람의 판단을 요청하는 구조. AI와 사람이 협력해 더 신뢰할 수 있는 결과를 냄.

Related Resources

Original Abstract (Expand)

Scaling educational assessment with large language models requires not just accuracy, but the ability to recognize when predictions are trustworthy. Instruction-tuned models tend to be overconfident, and their reliability deteriorates as curricula evolve, making fully autonomous deployment unsafe in high-stakes settings. We introduce CHiL(L)Grader, the first automated grading framework that incorporates calibrated confidence estimation into a human-in-the-loop workflow. Using post-hoc temperature scaling, confidence-based selective prediction, and continual learning, CHiL(L)Grader automates only high-confidence predictions while routing uncertain cases to human graders, and adapts to evolving rubrics and unseen questions. Across three short-answer grading datasets, CHiL(L)Grader automatically scores 35-65% of responses at expert-level quality (QWK >= 0.80). A QWK gap of 0.347 between accepted and rejected predictions confirms the effectiveness of the confidence-based routing. Each correction cycle strengthens the model's grading capability as it learns from teacher feedback. These results show that uncertainty quantification is key for reliable AI-assisted grading.