CHiL(L)Grader: Calibrated Human-in-the-Loop Short-Answer Grading
TL;DR Highlight
A Human-in-the-Loop grading system that auto-grades only when the LLM is confident, and routes uncertain answers to teachers.
Who Should Read
Devs building LLM-based auto-grading or evaluation pipelines. Especially EdTech backend devs designing workflows that trigger human review based on AI prediction confidence.
Core Mechanics
- LLMs show overconfidence — giving 0.99-level confidence even on wrong predictions. Temperature scaling (a technique to calibrate model output probabilities) reduces ECE by up to 65% to create reliable confidence scores
- Confidence threshold (τ) auto-grades predictions above the threshold and routes the rest to teachers — on the DAMI dataset, the top 35% of predictions achieve QWK 0.882 (expert level)
- +0.347 QWK gap confirmed between auto-accepted predictions (QWK 0.882) and rejected predictions (QWK 0.535) — confidence gate effectively distinguishes good from bad predictions
- Iterative LoRA fine-tuning with teacher correction data + replay buffer (prevents forgetting by replaying past data) → adapts to new question types while maintaining existing performance
- Removing replay buffer causes catastrophic forgetting: QWK collapses to 0.025 — replay buffer is essential
- When using Qwen-2.5-7B-Instruct as base model, grade bias (average +0.03) was nearly absent making it the only deployable candidate; Llama-3.1-8B was eliminated for 1.87-point average over-grading
Evidence
- After 1 HiL cycle on DAMI dataset: QWK 0.458 → 0.882 (+0.424 improvement), 35.1% auto-graded
- Temperature scaling reduces ECE 65% (DAMI: 0.270 → 0.094), EngSAF 53% (0.097 → 0.046)
- Auto-accepted vs rejected prediction QWK gap +0.347 (0.882 vs 0.535), 3.1x difference by MAE
- Zero-shot QWK 0.289 vs CHiL(L)Grader QWK 0.882 (~3x improvement), far exceeding Few-shot k=5 (0.603) and RAG (0.443)
How to Apply
- Add temperature scaling to an existing LLM grading/classification pipeline to get calibrated confidence, set threshold τ, and implement a routing layer that sends low-confidence cases to a human review queue
- Combine teacher/operator-corrected data with replay buffer (including some past training data) to periodically fine-tune only LoRA adapters — builds a continual learning loop that adapts to new patterns without full model retraining
- Confidence threshold τ works as an adjustable knob without retraining — lower τ for higher automation rate, raise τ for higher accuracy (EngSAF example: τ=0.5 gives 90% coverage, τ=0.8 gives 44.6% coverage but QWK 0.840)
Code Example
# CHiL(L)Grader Core Logic: Temperature Scaling + Selective Prediction
import torch
import torch.nn.functional as F
from scipy.optimize import minimize_scalar
import numpy as np
def compute_ece(confidences, accuracies, n_bins=10):
"""Compute Expected Calibration Error"""
bins = np.linspace(0, 1, n_bins + 1)
ece = 0.0
for i in range(n_bins):
mask = (confidences >= bins[i]) & (confidences < bins[i+1])
if mask.sum() == 0:
continue
bin_acc = accuracies[mask].mean()
bin_conf = confidences[mask].mean()
ece += (mask.sum() / len(confidences)) * abs(bin_acc - bin_conf)
return ece
def find_optimal_temperature(logits, labels, n_bins=10):
"""Search for optimal temperature that minimizes ECE"""
def ece_loss(T):
scaled_probs = F.softmax(torch.tensor(logits) / T, dim=-1).numpy()
confidences = scaled_probs.max(axis=-1)
predictions = scaled_probs.argmax(axis=-1)
accuracies = (predictions == labels).astype(float)
return compute_ece(confidences, accuracies, n_bins)
result = minimize_scalar(ece_loss, bounds=(0.1, 2.0), method='bounded')
return result.x
def selective_predict(logits, threshold_tau, optimal_T):
"""
Selective prediction based on calibrated confidence
Returns: (grade, confidence, should_auto_accept)
"""
scaled_probs = F.softmax(torch.tensor(logits) / optimal_T, dim=-1)
confidence = scaled_probs.max().item()
predicted_grade = scaled_probs.argmax().item()
auto_accept = confidence >= threshold_tau
return predicted_grade, confidence, auto_accept
# Prompt template (Basic)
GRADING_PROMPT = """System: You are an Automated Short Answer Grader (ASAG).
Return ONLY a strict JSON with keys: "grade" (int), "max_grade" (int).
User:
Question: {question}
Answer: {answer}
Target Scale: 0 to {max_grade}"""
# Usage example
# T_star = find_optimal_temperature(cal_logits, cal_labels)
# grade, conf, accept = selective_predict(logits, tau=0.4, optimal_T=T_star)
# if accept:
# final_grades.append(grade)
# else:
# human_review_queue.append((question, answer, grade)) # Add to teacher review queueTerminology
Related Resources
Original Abstract (Expand)
Scaling educational assessment with large language models requires not just accuracy, but the ability to recognize when predictions are trustworthy. Instruction-tuned models tend to be overconfident, and their reliability deteriorates as curricula evolve, making fully autonomous deployment unsafe in high-stakes settings. We introduce CHiL(L)Grader, the first automated grading framework that incorporates calibrated confidence estimation into a human-in-the-loop workflow. Using post-hoc temperature scaling, confidence-based selective prediction, and continual learning, CHiL(L)Grader automates only high-confidence predictions while routing uncertain cases to human graders, and adapts to evolving rubrics and unseen questions. Across three short-answer grading datasets, CHiL(L)Grader automatically scores 35-65% of responses at expert-level quality (QWK >= 0.80). A QWK gap of 0.347 between accepted and rejected predictions confirms the effectiveness of the confidence-based routing. Each correction cycle strengthens the model's grading capability as it learns from teacher feedback. These results show that uncertainty quantification is key for reliable AI-assisted grading.