Claude Opus 4.6 accuracy on BridgeBench hallucination test drops from 83% to 68%
TL;DR Highlight
Reports have emerged indicating a 15%p decrease in accuracy on the BridgeBench hallucination benchmark for the Claude Opus 4.6 model, sparking debate within the community regarding whether this represents a genuine performance degradation or simply noise.
Who Should Read
Backend/AI developers currently using the Anthropic Claude API in production and sensitive to changes in model quality, or developers interested in the reliability of LLM benchmarks.
Core Mechanics
- BridgeBench is a benchmark for measuring the level of hallucination (the phenomenon where a model generates factually incorrect content as if it were true) in LLMs. Tests on Claude Opus 4.6 reported a decrease in accuracy from 83% to 68%, approximately a 15%p drop.
- The results were published by the BridgeMind AI team (@bridgemindai) on X (formerly Twitter), but the original tweet is inaccessible without JavaScript, making it difficult to verify the details.
- A 15%p difference is a relatively large margin and difficult to dismiss as mere noise, especially if the benchmark is designed to be tested over multiple iterations.
- However, some methodological questions have been raised. The sample size and number of repetitions are not explicitly stated in the available information, raising the possibility that the results are based on a single run.
- LLMs are inherently non-deterministic (they can produce different outputs even with the same input), so it is difficult to conclude that model performance has actually deteriorated based on a single run.
Evidence
- One comment pointed out the lack of publicly available sample size and number of runs, stating, 'It seems like they only ran the entire test suite once.' The commenter argued that, due to the non-deterministic nature of the model, this is unlikely to be evidence of actual performance degradation.
- A counter-argument stated, '15% is a huge gap.' The commenter claimed that if the benchmark is designed to be tested thoroughly over multiple iterations, this difference is significant, and also expressed frustration that Anthropic is restricting access to its top-tier models.
- Some users expressed emotional dissatisfaction, stating, 'I want unrestricted access to the actual best model Anthropic uses, even if it costs more.' This reflects a long-standing distrust of model performance within the community.
- Unrelated to the discussion, a spam comment was posted promoting the author's Substack article, claiming that 'Computational Semiotics has been empirically proven.'
How to Apply
- If you are using the Claude API in production, it is good practice to build your own test set and run regular regression tests before and after model updates. Relying solely on external benchmark results can cause you to miss performance changes relevant to your actual service.
- When interpreting benchmark results, be sure to check the methodological details such as sample size, number of repetitions, and temperature setting. If metadata is unclear, as in this case, it is difficult to judge the reliability of the results.
- Considering the non-determinism of LLMs, important evaluations should be repeated at least dozens or hundreds of times and the average value should be used. Comparing performance between models or versions based on a single run can lead to incorrect conclusions.
Terminology
Related Papers
Distributed Attacks in Persistent-State AI Control
AI 코딩 에이전트가 여러 PR에 걸쳐 악성 코드를 분산 삽입하면 단일 모니터로는 탐지가 사실상 불가능하다는 걸 실험으로 증명.
Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers
기존 SWE-Bench가 과도하게 상세한 요구사항을 주는 '주니어 수준' 평가였다면, Senior SWE-Bench는 실제 시니어 엔지니어처럼 불완전한 요구사항에서 기능을 구현하고 버그를 추적하는 능력을 평가한다. 현재 최고 성능 모델(Claude Opus 4.8)도 24%밖에 못 푸는 난이도로, AI 코딩 에이전트의 실제 한계를 측정하려는 시도다.
Apple 'Hide My Email' vulnerability reveals peoples' real email addresses
iCloud+ 구독자가 프라이버시 보호용으로 사용하는 Apple의 Hide My Email 서비스에 1년 넘게 패치되지 않은 취약점이 있어, 공격자가 숨겨진 실제 이메일 주소를 알아낼 수 있다.
Words Speak Louder Than Code: Investigating Cognitive Heuristics in LLM-Based Code Vulnerability Detection
LLM 보안 스캐너가 코드 내용보다 '누가 썼는지', '어떻게 물어보는지'에 더 크게 반응해서 취약점을 97%까지 은폐시킬 수 있다.
Robust Harmful Features Under Jailbreak Attacks: Mechanistic Evidence from Attention Head Specialization in Large Language Models
Jailbreak 공격이 LLM 안전장치를 우회하는 원리를 attention head 단위로 해부하고, 공격에도 살아남는 내부 신호로 학습 없이 유해 입력을 탐지하는 방법을 제시.
What happened after 2k people tried to hack my AI assistant
실제로 6,000개 이상의 이메일로 AI 에이전트에 prompt injection 공격을 시도한 공개 실험 결과로, Claude Opus 4.6이 비밀 파일 유출을 한 번도 허용하지 않았지만 실험 설계의 현실성에 대한 논란이 뜨거웠다.