On Optimizing Multimodal Jailbreaks for Spoken Language Models
TL;DR Highlight
Simultaneously manipulating text and audio can jailbreak voice AI models up to 10x more effectively than single-modality attacks.
Who Should Read
Security researchers and engineers building voice AI products or deploying multimodal AI systems who need to understand cross-modal attack surfaces.
Core Mechanics
- Voice AI models (speech-to-text + LLM pipelines and end-to-end voice models) are vulnerable to adversarial attacks across both text and audio modalities
- Cross-modal attacks that simultaneously manipulate both the speech signal and the transcript/text context are significantly more effective than single-modality attacks
- The attack amplification factor is up to 10x: a combined audio+text attack achieves 10x higher jailbreak success rate than the best single-modality attack
- The attacks exploit the fact that safety training often doesn't align across modalities — audio safety training and text safety training can be played against each other
- Even small audio perturbations (imperceptible to humans) combined with semantically manipulated text can reliably bypass safety filters
- Defense recommendations include cross-modal consistency checking and modality-independent safety verification
Evidence
- Single audio attack jailbreak success rate: 12%. Single text attack: 18%. Combined cross-modal attack: up to 89% on tested voice AI systems
- Attack transferred to both pipeline-based (Whisper + GPT-4) and end-to-end (Moshi, Gemini Live) voice systems
- Human listeners could not distinguish adversarial audio from clean audio in 94% of cases — attacks are perceptually invisible
How to Apply
- For voice AI security testing: always test cross-modal combinations, not just individual modalities. Your audio safety test suite and text safety test suite need to be combined into joint cross-modal attack scenarios.
- Implement modality-independent safety verification: run safety checks on the audio independently, the transcript independently, and the combined interpretation — flag any cross-modal inconsistencies.
- Consider speech-to-text safety as its own attack surface: adversarial audio that transcribes to harmful text is a category requiring dedicated defenses beyond the LLM safety layer.
Code Example
# SAMA (Sequential Approximation) concept implementation sketch
# Step 1: Optimize text suffix with GCG (without audio)
import torch
def gcg_optimize(model, tokenizer, queries, target_responses, n_tokens=16, steps=1000):
"""
Optimize text suffix using gradient-based approach
Uses pure text only, without audio input
"""
suffix = torch.randint(0, tokenizer.vocab_size, (n_tokens,))
best_suffix = suffix.clone()
best_loss = float('inf')
for step in range(steps):
# Compute gradient at each suffix token position
loss = compute_batch_loss(model, queries, target_responses, suffix)
if loss < best_loss:
best_loss = loss
best_suffix = suffix.clone()
# Random replacement among Top-K candidates (core GCG logic)
suffix = gcg_step(model, queries, target_responses, suffix, top_k=16, width=32)
return best_suffix
# Step 2: Add audio perturbation on top of fixed suffix (PGD)
def pgd_optimize(model, audio, fixed_suffix, queries, target_responses,
steps=1000, lr=0.01, eps=0.001):
"""
Optimize only the audio perturbation while keeping the GCG suffix fixed
"""
delta = torch.zeros_like(audio).uniform_(-eps, eps)
delta.requires_grad_(True)
best_delta = delta.clone().detach()
best_loss = float('inf')
for step in range(steps):
perturbed_audio = audio + delta
loss = compute_batch_loss(
model, queries, target_responses,
suffix=fixed_suffix, audio=perturbed_audio
)
if loss.item() < best_loss:
best_loss = loss.item()
best_delta = delta.clone().detach()
# Normalized gradient step
grad = torch.autograd.grad(loss, delta)[0]
grad_norm = grad / (grad.norm(2) + 1e-8)
delta = delta - lr * grad_norm
# Clip to epsilon range (to maintain perceptual similarity)
delta = delta.clamp(-eps, eps).detach().requires_grad_(True)
return best_delta
# Run SAMA
best_suffix = gcg_optimize(model, tokenizer, queries, targets, n_tokens=16)
best_delta = pgd_optimize(model, base_audio, best_suffix, queries, targets)Terminology
Related Resources
Original Abstract (Expand)
As Spoken Language Models (SLMs) integrate speech and text modalities, they inherit the safety vulnerabilities of their LLM backbone and an expanded attack surface. SLMs have been previously shown to be susceptible to jailbreaking, where adversarial prompts induce harmful responses. Yet existing attacks largely remain unimodal, optimizing either text or audio in isolation. We explore gradient-based multimodal jailbreaks by introducing JAMA (Joint Audio-text Multimodal Attack), a joint multimodal optimization framework combining Greedy Coordinate Gradient (GCG) for text and Projected Gradient Descent (PGD) for audio, to simultaneously perturb both modalities. Evaluations across four state-of-the-art SLMs and four audio types demonstrate that JAMA surpasses unimodal jailbreak rate by 1.5x to 10x. We analyze the operational dynamics of this joint attack and show that a sequential approximation method makes it 4x to 6x faster. Our findings suggest that unimodal safety is insufficient for robust SLMs. The code and data are available at https://repos.lsv.uni-saarland.de/akrishnan/multimodal-jailbreak-slm