FUSE: Ensembling Verifiers with Zero Labeled Data
TL;DR Highlight
FUSE automatically ensembles multiple LLM verification models without ground truth labels, achieving Best-of-N performance comparable to semi-supervised learning.
Who Should Read
ML engineers deploying test-time scaling or Best-of-N sampling in production, and developers seeking to improve response quality by combining multiple reward models without access to labeled data.
Core Mechanics
- FUSE automatically weights and ensembles scores from multiple verifiers (LLM judgment models, reward models, etc.) without ground truth labels, outperforming simple averaging or majority voting in accurately selecting the best response.
- The core idea is to automatically transform verifier scores to maximize satisfaction of Triplet Conditional Independence (TCI—the assumption that the outputs of three verifiers are independent given the ground truth) and then estimate each verifier’s accuracy using statistical moment techniques.
- Because real-world LLM verifiers often violate TCI, directly applying the existing Jaffe et al. (2015) algorithm results in worse performance than naive ensemble in 7 out of 10 settings. FUSE resolves this issue with a score transformation step.
- Pseudo-labels are created from the estimated verifier accuracies and used to train an ensemble function like logistic regression to select the final response, eliminating the need for the stronger independence assumption of Joint Conditional Independence (JCI).
- Thanks to its query-conditional mode, which operates independently for each query, FUSE outperforms semi-supervised learning-based WEAVER even in heterogeneous environments with mixed domains, with FUSE’s advantage increasing when labels are limited to a specific domain.
- FUSE works even on state-of-the-art models like Gemini 3 Pro and GPT-5 on the still-unsolved Humanity's Last Exam benchmark, with potential applications in data filtering, benchmark auditing, and unsupervised model ranking.
Evidence
- "On GPQA Diamond (70B, Best-of-100), FUSE achieved 64.4% vs. WEAVER (using 5% labels) at 64.1%, reaching semi-supervised learning levels without labels, outperforming the baseline in 27 out of 40 comparisons.\n\nOn Humanity's Last Exam (649 questions, Best-of-50), FUSE scored 54.3%, surpassing Pass@1 (52.1%), WEAVER (51.2%), and naive ensemble (51.4%). Naive ensemble was the only benchmark where performance was lower than random selection.\n\nOn IMO Shortlist (123 questions, Best-of-50), FUSE achieved 63.8%, outperforming WEAVER (62.1%), semi-supervised logistic regression (60.2%), and the oracle best verifier (59.7%).\n\nOn the Saad-Falcon et al. dataset (8B/70B, 10 settings), FUSE improved performance by at least +2.3%p and up to +12.3%p compared to naive ensemble, and up to 17.0%p (MMLU Pro 70B) compared to majority vote."
How to Apply
- "If you are running a Best-of-N pipeline and have multiple reward models, apply the FUSE algorithm after min-max normalizing each model’s scores to the range [-1, 1] to automatically estimate ensemble weights without collecting labels.\n\nWhen processing a mixed-domain query set (e.g., math + coding + common sense), use FUSE’s query-conditional mode to allow verifier weights to vary by domain, achieving better performance than semi-supervised learning methods trained on a single label set.\n\nWhen you need to select high-quality responses without ground truth labels for synthetic data selection or RLHF data filtering, use multiple LLM judges as verifiers, ensemble scores with FUSE, and use only the top responses as training data."
Code Example
Terminology
Related Papers
Can LLMs model real-world systems in TLA+?
LLM이 TLA+ 명세를 작성할 때 문법은 잘 통과하지만 실제 시스템과의 동작 일치도(conformance)는 46% 수준에 그친다는 걸 체계적으로 검증한 벤치마크 연구로, AI 기반 형식 검증의 현실적 한계를 보여준다.
Natural Language Autoencoders: Turning Claude's Thoughts into Text
Anthropic이 LLM 내부의 숫자 벡터(활성화값)를 직접 읽을 수 있는 자연어로 변환하는 NLA 기법을 공개했다. AI가 실제로 무슨 생각을 하는지 해석하는 interpretability 연구의 새로운 진전이다.
ProgramBench: Can language models rebuild programs from scratch?
LLM이 FFmpeg, SQLite, PHP 인터프리터 같은 실제 소프트웨어를 문서만 보고 처음부터 재구현할 수 있는지 측정하는 새 벤치마크로, 최고 모델도 전체 태스크의 3%만 95% 이상 통과하는 수준에 그쳤다.
MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents
티켓 3장으로 쪼개면 Claude/GPT도 보안 취약점 코드를 53~86% 확률로 그냥 짜준다.
Refusal in Language Models Is Mediated by a Single Direction
Open-source chat models encode safety as a single vector direction, and removing it disables safety fine-tuning.
Show HN: A new benchmark for testing LLMs for deterministic outputs
Structured Output Benchmark assesses LLM JSON handling across seven metrics, revealing performance beyond schema compliance.
Related Resources
Original Abstract (Expand)
Verification of model outputs is rapidly emerging as a key primitive for both training and real-world deployment of large language models (LLMs). In practice, this often involves using imperfect LLM judges and reward models since ground truth acquisition can be time-consuming and expensive. We introduce Fully Unsupervised Score Ensembling (FUSE), a method for improving verification quality by ensembling verifiers without access to ground truth correctness labels. The key idea behind FUSE is to control conditional dependencies between verifiers in a manner that improves the unsupervised performance of a class of spectral algorithms from the ensembling literature. Despite requiring zero ground truth labels, FUSE typically matches or improves upon semi-supervised alternatives in test-time scaling experiments with diverse sets of generator models, verifiers, and benchmarks. In particular, we validate our method on both conventional academic benchmarks such as GPQA Diamond and on frontier, unsaturated benchmarks such as Humanity's Last Exam and IMO Shortlist questions.