BenchOverflow: Plain-Text 프롬프트로 LLM의 과도한 출력(Overflow) 측정

BenchOverflow: Measuring Overflow in Large Language Models via Plain-Text Prompts

Jan 13, 2026•Erin Feiglin, Nir Hutnik, Raz Lapid•View PDF

TL;DR Highlight

일반 텍스트 프롬프트만으로 LLM이 토큰을 폭발적으로 생성하게 만드는 9가지 패턴을 발견하고, 간단한 한 줄 방어법으로 절반 이상 줄일 수 있음을 보였다.

Who Should Read

LLM API를 서비스에 붙여서 운영 비용이나 latency를 걱정하는 백엔드/MLOps 개발자. 특히 공개 챗봇이나 멀티테넌트 환경에서 LLM을 노출하고 있는 팀.

Core Mechanics

jailbreak 없이 평범한 자연어 프롬프트만으로 GPT-5, Claude-Sonnet, Gemini-2.5-Flash 등 9개 모델 전부에서 출력 길이를 5,000 토큰 한도까지 밀어붙일 수 있음
가장 강력한 패턴 2개: '명시적 길이 강제'(예: '1,200개 퀴즈 만들어줘')와 'Tokenizer stress'(이모지·유니코드 조합 사용)는 CSR@5k(5,000토큰 초과 비율)가 최대 69%까지 나옴
모델이 거절(refusal) 응답을 하더라도 출력이 짧아지지 않는 경우가 많음 — '못 해드리지만 여기 짧게 요약하면...' 하고는 수천 토큰을 생성하는 자기모순 패턴 발생
시스템 프롬프트 맨 앞에 'Please provide a concise, precise response without unnecessary elaboration.' 한 줄만 추가해도 Gemini-2.5-Flash는 평균 토큰을 647→51로 92% 감소, Qwen-3-8B는 1,301→152로 88% 감소
Gemma-2-9B-It은 overflow 전략에 거부율이 높아 상대적으로 견고하지만, GPT-5는 거부 없이 긴 출력을 따르는 경향이 강함 — alignment 설계 철학 차이가 원인
같은 프롬프트를 4번 반복 실행했을 때 GPT-5, Claude-Sonnet, Gemma 계열은 출력 길이가 거의 일정하지만, LLaMA-3.x와 Gemini-2.5-Flash는 실행마다 길이가 크게 달라지는 불안정성 있음

Evidence

conciseness reminder 한 줄 추가 시 평균 토큰 감소: GPT-5 1,933→1,365(30%), Claude-Sonnet 310→112(64%), Gemini-2.5-Flash 647→51(92%), Qwen-3-8B 1,301→152(88%), Gemma-3-4B-It 950→70(93%)
Explicit forced length 전략의 CSR@5k(5,000토큰 돌파율): GPT-5 63%, Claude-Sonnet 69%, LLaMA-3.2-3B 39.8% — benign 기준선은 거의 0%
Tokenizer stress의 CSR@3k: GPT-5 75.5%, Qwen-3-8B 51.5%, Gemini-2.5-Flash 38.8%
LLaMA 패밀리 내 cross-model 상관관계 69~71%로 높고, GPT-5와 Claude-Sonnet은 타 패밀리와 51~54% 상관 — overflow 패턴이 모델 계보를 타고 전이됨

How to Apply

공개 API나 챗봇에 LLM을 붙일 때, 시스템 프롬프트 첫 줄에 'Please provide a concise, precise response without unnecessary elaboration.'을 추가하면 추가 설정 없이 대부분 모델에서 verbosity를 50~90% 줄일 수 있음
사용자 프롬프트 검증 레이어에서 '1,000개 목록', '전체 텍스트 복사', '무한 반복' 같은 키워드를 필터링하거나 max_tokens를 강제 적용하는 게이트웨이를 두면 Overflow DoS 위험을 낮출 수 있음
모델 선택 시 비용·latency 안정성이 중요하다면 Gemma-2-9B-It처럼 거부율 높고 out-of-budget 비율이 낮은 모델을 우선 고려하고, GPT-5처럼 길게 따라가는 경향의 모델은 max_tokens 제한을 반드시 설정할 것

Code Example

snippet

# 시스템 프롬프트에 conciseness reminder 추가 예시 (OpenAI SDK)
import openai

client = openai.OpenAI()

CONCISENESS_REMINDER = "Reminder: Please provide a concise, precise response without unnecessary elaboration."

def chat_with_overflow_defense(user_message: str, max_tokens: int = 1000) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        max_tokens=max_tokens,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": f"{user_message}\n\n{CONCISENESS_REMINDER}"}
        ]
    )
    return response.choices[0].message.content

# Overflow 유발 프롬프트 예시 (탐지/필터링용)
OVERFLOW_PATTERNS = [
    r"\b(\d{3,})[\s]*(unique|different|distinct)",  # '1,200 unique items'
    r"\b(all|every|each)\b.*\b(integer|permutation|combination)",  # implicit enumeration
    r"\bwithout (stopping|end|limit)",  # infinite generation
    r"\bfull text\b|\bverbatim\b|\btranscribe\b",  # quote attack
]

import re

def has_overflow_risk(prompt: str) -> bool:
    return any(re.search(p, prompt, re.IGNORECASE) for p in OVERFLOW_PATTERNS)

Terminology

OverflowLLM이 별다른 해킹 없이 '1,000개 목록 만들어줘' 같은 평범한 요청에도 토큰을 무한정 쏟아내는 현상. 수도꼭지를 열었더니 막을 방법이 없는 것과 비슷.

CSR (Cap-Saturation Rate)전체 생성 중에서 설정한 토큰 한도(예: 1k, 3k, 5k)를 초과한 비율. 높을수록 모델이 자주 토큰 상한선에 부딪힌다는 뜻.

ECDF (Empirical Cumulative Distribution Function)출력 길이들을 작은 것부터 쌓아 올린 분포 그래프. 곡선이 오른쪽으로 치우칠수록 긴 출력이 많다는 의미.

Denial of Wallet클라우드 서비스에서 공격자가 과도한 요청을 보내 피해자의 돈(크레딧)을 소진시키는 공격. LLM 맥락에서는 토큰을 폭발적으로 생성시켜 API 비용을 고갈시키는 것.

RLHF (Reinforcement Learning from Human Feedback)사람이 '이 답변이 더 좋다'고 평가한 데이터로 모델을 추가 학습시키는 방법. 모델이 더 도움이 되고 안전한 방향으로 답변하도록 유도함.

Tokenizer stress이모지, 유니코드 조합 문자처럼 실제 보이는 글자 수는 적지만 토큰 수가 많이 나오는 입력을 써서 모델이 긴 출력을 내도록 유도하는 기법.

LLM-as-a-judge다른 LLM의 출력을 평가하기 위해 또 다른 LLM을 심사위원으로 쓰는 방식. 사람이 일일이 채점하는 대신 GPT 같은 모델에게 '이 답이 거절인가 아닌가'를 판단시킴.

Related Resources

Original Abstract (Expand)

We investigate a failure mode of large language models (LLMs) in which plain-text prompts elicit excessive outputs, a phenomenon we term Overflow. Unlike jailbreaks or prompt injection, Overflow arises under ordinary interaction settings and can lead to elevated serving cost, latency, and cross-user performance degradation, particularly when scaled across many requests. Beyond usability, the stakes are economic and environmental: unnecessary tokens increase per-request cost and energy consumption, compounding into substantial operational spend and carbon footprint at scale. Moreover, Overflow represents a practical vector for compute amplification and service degradation in shared environments. We introduce BenchOverflow, a model-agnostic benchmark of nine plain-text prompting strategies that amplify output volume without adversarial suffixes or policy circumvention. Using a standardized protocol with a fixed budget of 5000 new tokens, we evaluate nine open- and closed-source models and observe pronounced rightward shifts and heavy tails in length distributions. Cap-saturation rates (CSR@1k/3k/5k) and empirical cumulative distribution functions (ECDFs) quantify tail risk; within-prompt variance and cross-model correlations show that Overflow is broadly reproducible yet heterogeneous across families and attack vectors. A lightweight mitigation-a fixed conciseness reminder-attenuates right tails and lowers CSR for all strategies across the majority of models. Our findings position length control as a measurable reliability, cost, and sustainability concern rather than a stylistic quirk. By enabling standardized comparison of length-control robustness across models, BenchOverflow provides a practical basis for selecting deployments that minimize resource waste and operating expense, and for evaluating defenses that curb compute amplification without eroding task performance.