CHiL(L)Grader: Calibrated Human-in-the-Loop Short-Answer Grading

Mar 12, 2026•Pranav Raikote, Korbinian Randl, Ioanna Miliou +2•View PDF

TL;DR Highlight

A Human-in-the-Loop grading system that auto-grades only when the LLM is confident, and routes uncertain answers to teachers.

Who Should Read

Devs building LLM-based auto-grading or evaluation pipelines. Especially EdTech backend devs designing workflows that trigger human review based on AI prediction confidence.

Core Mechanics

LLMs show overconfidence — giving 0.99-level confidence even on wrong predictions. Temperature scaling (a technique to calibrate model output probabilities) reduces ECE by up to 65% to create reliable confidence scores
Confidence threshold (τ) auto-grades predictions above the threshold and routes the rest to teachers — on the DAMI dataset, the top 35% of predictions achieve QWK 0.882 (expert level)
+0.347 QWK gap confirmed between auto-accepted predictions (QWK 0.882) and rejected predictions (QWK 0.535) — confidence gate effectively distinguishes good from bad predictions
Iterative LoRA fine-tuning with teacher correction data + replay buffer (prevents forgetting by replaying past data) → adapts to new question types while maintaining existing performance
Removing replay buffer causes catastrophic forgetting: QWK collapses to 0.025 — replay buffer is essential
When using Qwen-2.5-7B-Instruct as base model, grade bias (average +0.03) was nearly absent making it the only deployable candidate; Llama-3.1-8B was eliminated for 1.87-point average over-grading

Evidence

After 1 HiL cycle on DAMI dataset: QWK 0.458 → 0.882 (+0.424 improvement), 35.1% auto-graded
Temperature scaling reduces ECE 65% (DAMI: 0.270 → 0.094), EngSAF 53% (0.097 → 0.046)
Auto-accepted vs rejected prediction QWK gap +0.347 (0.882 vs 0.535), 3.1x difference by MAE
Zero-shot QWK 0.289 vs CHiL(L)Grader QWK 0.882 (~3x improvement), far exceeding Few-shot k=5 (0.603) and RAG (0.443)

How to Apply

Add temperature scaling to an existing LLM grading/classification pipeline to get calibrated confidence, set threshold τ, and implement a routing layer that sends low-confidence cases to a human review queue
Combine teacher/operator-corrected data with replay buffer (including some past training data) to periodically fine-tune only LoRA adapters — builds a continual learning loop that adapts to new patterns without full model retraining
Confidence threshold τ works as an adjustable knob without retraining — lower τ for higher automation rate, raise τ for higher accuracy (EngSAF example: τ=0.5 gives 90% coverage, τ=0.8 gives 44.6% coverage but QWK 0.840)

Code Example

snippet

# CHiL(L)Grader Core Logic: Temperature Scaling + Selective Prediction
import torch
import torch.nn.functional as F
from scipy.optimize import minimize_scalar
import numpy as np

def compute_ece(confidences, accuracies, n_bins=10):
    """Compute Expected Calibration Error"""
    bins = np.linspace(0, 1, n_bins + 1)
    ece = 0.0
    for i in range(n_bins):
        mask = (confidences >= bins[i]) & (confidences < bins[i+1])
        if mask.sum() == 0:
            continue
        bin_acc = accuracies[mask].mean()
        bin_conf = confidences[mask].mean()
        ece += (mask.sum() / len(confidences)) * abs(bin_acc - bin_conf)
    return ece

def find_optimal_temperature(logits, labels, n_bins=10):
    """Search for optimal temperature that minimizes ECE"""
    def ece_loss(T):
        scaled_probs = F.softmax(torch.tensor(logits) / T, dim=-1).numpy()
        confidences = scaled_probs.max(axis=-1)
        predictions = scaled_probs.argmax(axis=-1)
        accuracies = (predictions == labels).astype(float)
        return compute_ece(confidences, accuracies, n_bins)
    
    result = minimize_scalar(ece_loss, bounds=(0.1, 2.0), method='bounded')
    return result.x

def selective_predict(logits, threshold_tau, optimal_T):
    """
    Selective prediction based on calibrated confidence
    Returns: (grade, confidence, should_auto_accept)
    """
    scaled_probs = F.softmax(torch.tensor(logits) / optimal_T, dim=-1)
    confidence = scaled_probs.max().item()
    predicted_grade = scaled_probs.argmax().item()
    auto_accept = confidence >= threshold_tau
    return predicted_grade, confidence, auto_accept

# Prompt template (Basic)
GRADING_PROMPT = """System: You are an Automated Short Answer Grader (ASAG). 
Return ONLY a strict JSON with keys: "grade" (int), "max_grade" (int).

User:
Question: {question}
Answer: {answer}
Target Scale: 0 to {max_grade}"""

# Usage example
# T_star = find_optimal_temperature(cal_logits, cal_labels)
# grade, conf, accept = selective_predict(logits, tau=0.4, optimal_T=T_star)
# if accept:
#     final_grades.append(grade)
# else:
#     human_review_queue.append((question, answer, grade))  # Add to teacher review queue

Terminology

QWKQuadratic Weighted Kappa. A metric measuring how well two graders (AI and human) agree. Above 0.8 is considered equivalent to human grader level.

ECEExpected Calibration Error. Measures whether the model is actually right 80% of the time when it says "80% confident." Closer to 0 means the confidence is accurate.

Temperature ScalingA technique that divides model output probabilities by a single number (T) to calibrate confidence. T>1 lowers probability to prevent overconfidence; T<1 raises it.

Selective PredictionA method that "accepts" predictions only when the model is confident, routing uncertain ones to humans. Allows tuning the trade-off between accuracy and automation rate.

Catastrophic ForgettingThe phenomenon of suddenly forgetting previously learned things when retraining on new data. Like forgetting your native language while learning a new one.

LoRAA technique that trains only small additional adapter layers without retraining the whole model. Much cheaper while still being able to adapt to specific tasks.

Continual LearningA learning approach where the model is incrementally updated as new data arrives. Continuously adapts to new patterns without retraining from scratch.

Human-in-the-LoopA structure where AI doesn't make all decisions alone but requests human judgment in uncertain cases. AI and humans collaborate for more trustworthy results.

Related Resources

Original Abstract (Expand)

Scaling educational assessment with large language models requires not just accuracy, but the ability to recognize when predictions are trustworthy. Instruction-tuned models tend to be overconfident, and their reliability deteriorates as curricula evolve, making fully autonomous deployment unsafe in high-stakes settings. We introduce CHiL(L)Grader, the first automated grading framework that incorporates calibrated confidence estimation into a human-in-the-loop workflow. Using post-hoc temperature scaling, confidence-based selective prediction, and continual learning, CHiL(L)Grader automates only high-confidence predictions while routing uncertain cases to human graders, and adapts to evolving rubrics and unseen questions. Across three short-answer grading datasets, CHiL(L)Grader automatically scores 35-65% of responses at expert-level quality (QWK >= 0.80). A QWK gap of 0.347 between accepted and rejected predictions confirms the effectiveness of the confidence-based routing. Each correction cycle strengthens the model's grading capability as it learns from teacher feedback. These results show that uncertainty quantification is key for reliable AI-assisted grading.