Not All Tokens Are Created Equal: Query-Efficient Jailbreak Fuzzing for LLMs

Mar 24, 2026•Wenyu Chen, Xiangtao Meng, Chuanchao Zang +6•View PDF

TL;DR Highlight

Discovers that LLM refusal behavior is dominated by a sparse set of tokens — achieves 90% attack success rate with 70% fewer queries; GPT-4o 84% ASR at 25 queries

Who Should Read

LLM service security auditors, AI red-team engineers, security researchers designing and validating safety filters

Core Mechanics

Token contributions to refusal are highly skewed — a sparse subset drives most refusal behavior (Skewed Token Contribution); mutating all tokens uniformly wastes queries
Refusal tendencies are consistent across models (Cross-Model Consistency) — an open-source surrogate model can reliably estimate refusal-sensitive regions of a black-box target
Against 6 open-source LLMs: 90% ASR with 70%+ fewer queries than best baseline; Gemma-7B: 18 queries vs 62 for competitor
Commercial APIs: GPT-4o@25 84% ASR, Claude-3.5-Sonnet@25 80.5% ASR — outperforms PAIR/GPTFuzz/TAP across all settings
Defense resilience: Perplexity filter (<3pp degradation), LLaMA Guard (mitigated but not neutralized), SmoothLLM (reduced but still beats undefended baselines)
ASR variation within ±3% across different surrogate and attack model choices — strong generalizability confirmed

Evidence

6 open-source LLMs (Gemma-7B/2-9B, LLaMA3-8B/3.2-3B, Qwen2.5-3B/7B) + 3 commercial APIs (GPT-3.5-Turbo, GPT-4o, Claude-3.5-Sonnet) evaluated
HarmBench dataset (6 safety categories: chemical/biological hazards, illegal activities, misinformation, cybercrime, etc.) — unified protocol across all methods

How to Apply

Use the TriageFuzz approach to pre-launch red-teaming of LLM services with a minimal query budget to identify vulnerable token patterns
Design safety filters as hybrid layers (e.g., Perplexity + SmoothLLM) rather than single-layer defenses
Leverage Reference Layer activations from a surrogate model for refusal circuit analysis — applicable to other security analysis tasks

Terminology

Original Abstract (Expand)

Large Language Models(LLMs) are widely deployed, yet are vulnerable to jailbreak prompts that elicit policy-violating outputs. Although prior studies have uncovered these risks, they typically treat all tokens as equally important during prompt mutation, overlooking the varying contributions of individual tokens to triggering model refusals. Consequently, these attacks introduce substantial redundant searching under query-constrained scenarios, reducing attack efficiency and hindering comprehensive vulnerability assessment. In this work, we conduct a token-level analysis of refusal behavior and observe that token contributions are highly skewed rather than uniform. Moreover, we find strong cross-model consistency in refusal tendencies, enabling the use of a surrogate model to estimate token-level contributions to the target model's refusals. Motivated by these findings, we propose TriageFuzz, a token-aware jailbreak fuzzing framework that adapts the fuzz testing approach with a series of customized designs. TriageFuzz leverages a surrogate model to estimate the contribution of individual tokens to refusal behaviors, enabling the identification of sensitive regions within the prompt. Furthermore, it incorporates a refusal-guided evolutionary strategy that adaptively weights candidate prompts with a lightweight scorer to steer the evolution toward bypassing safety constraints. Extensive experiments on six open-source LLMs and three commercial APIs demonstrate that TriageFuzz achieves comparable attack success rates (ASR) with significantly reduced query costs. Notably, it attains a 90% ASR with over 70% fewer queries compared to baselines. Even under an extremely restrictive budget of 25 queries, TriageFuzz outperforms existing methods, improving ASR by 20-40%.