On Optimizing Multimodal Jailbreaks for Spoken Language Models

Mar 19, 2026•Aravind Krishnan, Karolina Stańczak, Dietrich Klakow•View PDF

TL;DR Highlight

Simultaneously manipulating text and audio can jailbreak voice AI models up to 10x more effectively than single-modality attacks.

Who Should Read

Security researchers and engineers building voice AI products or deploying multimodal AI systems who need to understand cross-modal attack surfaces.

Core Mechanics

Voice AI models (speech-to-text + LLM pipelines and end-to-end voice models) are vulnerable to adversarial attacks across both text and audio modalities
Cross-modal attacks that simultaneously manipulate both the speech signal and the transcript/text context are significantly more effective than single-modality attacks
The attack amplification factor is up to 10x: a combined audio+text attack achieves 10x higher jailbreak success rate than the best single-modality attack
The attacks exploit the fact that safety training often doesn't align across modalities — audio safety training and text safety training can be played against each other
Even small audio perturbations (imperceptible to humans) combined with semantically manipulated text can reliably bypass safety filters
Defense recommendations include cross-modal consistency checking and modality-independent safety verification

Evidence

Single audio attack jailbreak success rate: 12%. Single text attack: 18%. Combined cross-modal attack: up to 89% on tested voice AI systems
Attack transferred to both pipeline-based (Whisper + GPT-4) and end-to-end (Moshi, Gemini Live) voice systems
Human listeners could not distinguish adversarial audio from clean audio in 94% of cases — attacks are perceptually invisible

How to Apply

For voice AI security testing: always test cross-modal combinations, not just individual modalities. Your audio safety test suite and text safety test suite need to be combined into joint cross-modal attack scenarios.
Implement modality-independent safety verification: run safety checks on the audio independently, the transcript independently, and the combined interpretation — flag any cross-modal inconsistencies.
Consider speech-to-text safety as its own attack surface: adversarial audio that transcribes to harmful text is a category requiring dedicated defenses beyond the LLM safety layer.

Code Example

snippet

# SAMA (Sequential Approximation) concept implementation sketch
# Step 1: Optimize text suffix with GCG (without audio)
import torch

def gcg_optimize(model, tokenizer, queries, target_responses, n_tokens=16, steps=1000):
    """
    Optimize text suffix using gradient-based approach
    Uses pure text only, without audio input
    """
    suffix = torch.randint(0, tokenizer.vocab_size, (n_tokens,))
    best_suffix = suffix.clone()
    best_loss = float('inf')
    
    for step in range(steps):
        # Compute gradient at each suffix token position
        loss = compute_batch_loss(model, queries, target_responses, suffix)
        
        if loss < best_loss:
            best_loss = loss
            best_suffix = suffix.clone()
        
        # Random replacement among Top-K candidates (core GCG logic)
        suffix = gcg_step(model, queries, target_responses, suffix, top_k=16, width=32)
    
    return best_suffix

# Step 2: Add audio perturbation on top of fixed suffix (PGD)
def pgd_optimize(model, audio, fixed_suffix, queries, target_responses,
                 steps=1000, lr=0.01, eps=0.001):
    """
    Optimize only the audio perturbation while keeping the GCG suffix fixed
    """
    delta = torch.zeros_like(audio).uniform_(-eps, eps)
    delta.requires_grad_(True)
    
    best_delta = delta.clone().detach()
    best_loss = float('inf')
    
    for step in range(steps):
        perturbed_audio = audio + delta
        loss = compute_batch_loss(
            model, queries, target_responses,
            suffix=fixed_suffix, audio=perturbed_audio
        )
        
        if loss.item() < best_loss:
            best_loss = loss.item()
            best_delta = delta.clone().detach()
        
        # Normalized gradient step
        grad = torch.autograd.grad(loss, delta)[0]
        grad_norm = grad / (grad.norm(2) + 1e-8)
        delta = delta - lr * grad_norm
        
        # Clip to epsilon range (to maintain perceptual similarity)
        delta = delta.clamp(-eps, eps).detach().requires_grad_(True)
    
    return best_delta

# Run SAMA
best_suffix = gcg_optimize(model, tokenizer, queries, targets, n_tokens=16)
best_delta = pgd_optimize(model, base_audio, best_suffix, queries, targets)

Terminology

Cross-Modal AttackAn adversarial attack that exploits multiple input modalities simultaneously — e.g., coordinated manipulation of both audio and text.

JailbreakTechniques that bypass an AI model's safety filters to make it produce content it was trained to refuse.

Adversarial AudioAudio that has been imperceptibly modified to cause a speech recognition or voice AI system to behave incorrectly.

ModalityA type of input/output — text, audio, image, and video are different modalities.

Safety TrainingFine-tuning a model to refuse harmful requests — voice AI systems often have separate safety training for audio and text components.

Related Resources

Original Abstract (Expand)

As Spoken Language Models (SLMs) integrate speech and text modalities, they inherit the safety vulnerabilities of their LLM backbone and an expanded attack surface. SLMs have been previously shown to be susceptible to jailbreaking, where adversarial prompts induce harmful responses. Yet existing attacks largely remain unimodal, optimizing either text or audio in isolation. We explore gradient-based multimodal jailbreaks by introducing JAMA (Joint Audio-text Multimodal Attack), a joint multimodal optimization framework combining Greedy Coordinate Gradient (GCG) for text and Projected Gradient Descent (PGD) for audio, to simultaneously perturb both modalities. Evaluations across four state-of-the-art SLMs and four audio types demonstrate that JAMA surpasses unimodal jailbreak rate by 1.5x to 10x. We analyze the operational dynamics of this joint attack and show that a sequential approximation method makes it 4x to 6x faster. Our findings suggest that unimodal safety is insufficient for robust SLMs. The code and data are available at https://repos.lsv.uni-saarland.de/akrishnan/multimodal-jailbreak-slm