확장 가능한 Synthetic Data 생성을 위한 Dynamic Context Evolution

TL;DR Highlight

VTS + Semantic Memory + Adaptive Prompt 3가지 메커니즘으로 구성된 프레임워크는 LLM 대량 synthetic data 생성 시 배치 간 중복·반복 현상을 완전히 제거한다.

Who Should Read

LLM API로 학습용 synthetic data를 대량 생성하려는 ML 엔지니어 또는 데이터 파이프라인 개발자. 특히 수백~수천 배치 반복 생성 시 중복 데이터 문제를 겪고 있는 경우.

Core Mechanics

배치를 독립적으로 반복 프롬프팅하면 LLM이 동일한 개념을 반복 생성하는 'cross-batch mode collapse' 현상이 생김 — 교육 도메인에서는 마지막 50배치 중 34%가 첫 50배치와 중복.
DCE는 3가지 메커니즘 조합: ① VTS(Verbalized Tail Sampling, 모델 스스로 '뻔하다' 판단한 아이디어 버리기) ② Semantic Memory(ChromaDB에 임베딩 저장해 유사 아이디어 거부) ③ Adaptive Prompt Evolution(매 배치마다 메모리 상태 기반으로 프롬프트 재구성).
VTS는 '뻔하지만 의미적으론 새로운' 아이디어를 걸러내고, dedup은 '의미적으로 중복된' 아이디어를 걸러냄 — 둘은 96.9% 비겹침 집합에 작동하므로 반드시 함께 써야 함.
Adaptive Prompt는 4가지 전략을 라운드로빈으로 순환: Gap targeting(덜 생성된 카테고리 집중), Assumption inversion(최근 아이디어의 가정 뒤집기), Cross-industry stimulus(다른 산업 시각 도입), Constraint variation(극단적 제약 조건 부여).
GPT-5-mini 기준 Claude Haiku 4.5는 훨씬 반복적(dedup 거부율 30.1% vs 5.7%) — DCE 적용 시 Claude의 거부율을 30.1% → 11.0%로 19%p 줄여줌. 모델 교체 없이 동일 파이프라인 사용 가능.
비용은 1,000개 후보 기준 약 $0.50, 파인튜닝·커스텀 아키텍처 불필요하며 표준 API 호출만으로 동작.

Evidence

Collapse rate: DCE 0.0 ± 0.0% vs naive 5.6 ± 2.0% (3 seeds, 패키징 도메인). 교육 도메인에서는 naive 34% collapse를 DCE가 0%로 제거.
HDBSCAN cluster 수: DCE는 시드별 17~18개로 일관 vs naive는 2~17개로 변동 심함 (seeds 42/123/456 기준 DCE: 18/18/17, naive: 2/14/17).
Downstream 분류기(DeBERTa-base) F1: 패키징 도메인 unconstrained 기준 DCE 30.5% vs naive 15.2% (약 2배). 교육 도메인에서 δ=0.90 완화 시 naive 대비 44.9% F1 달성.
VTS 분석: VTS가 거부한 852개 아이디어 중 96.9%(826개)는 dedup 기준으론 통과될 만큼 의미적으로 새로운 것 — VTS는 다양성을 파괴하지 않고 '뻔한 것'만 제거.

How to Apply

LLM API로 학습 데이터를 100배치 이상 반복 생성하는 경우: ChromaDB를 백엔드로 세팅하고, 각 배치 생성 전에 최근 10개 아이디어 + 밀집 클러스터 + 미탐색 카테고리를 프롬프트에 주입하면 된다. 코사인 유사도 임계값 δ=0.85로 dedup 필터를 추가하면 collapse 즉시 0%.
도메인별 δ 튜닝이 필요한 경우: 교육·Q&A처럼 자연 중복률이 높은 도메인은 δ=0.90으로 완화해 학습 세트 크기를 확보하고, 패키징·창작처럼 다양한 도메인은 기본값 δ=0.85 유지. F1 성능이 낮으면 먼저 δ 완화를 시도.
GPT-5-mini처럼 temperature/top-p 파라미터를 API에서 제공하지 않는 모델을 쓰는 경우: 토큰 레벨 다양성 조절이 불가능하므로 DCE 같은 개념 레벨 접근이 유일한 옵션. VTS 프롬프트로 모델 스스로 'P < 0.10인 아이디어만 생성하도록' 유도하면 된다.

Code Example

snippet

Terminology

cross-batch mode collapseLLM을 여러 번 독립 호출하면 점점 같은 내용만 반복 생성하는 현상. 마치 시험 문제 200개 뽑으라고 했는데 나중엔 앞에 나온 문제를 조금 바꿔서 내는 것과 같음.

VTSVerbalized Tail Sampling. 모델 스스로 '이 아이디어가 얼마나 뻔한지' 점수 매기게 하고, 뻔한 것(P >= 0.10)은 버리는 필터. 자기 자신을 검열하게 만드는 방식.

Semantic Memory생성된 아이디어를 벡터(숫자 배열)로 변환해 DB에 저장하고, 새 아이디어가 기존 것과 너무 비슷하면 거부하는 메커니즘. '스마트 물병'과 '지능형 수분 용기'처럼 말만 다르고 뜻이 같은 것도 잡아냄.

HDBSCAN데이터 포인트를 밀도 기반으로 자동 군집화하는 알고리즘. 여기서는 생성된 아이디어들이 몇 개의 개념적 그룹으로 나뉘는지 측정하는 데 사용. 클러스터 수가 많을수록 아이디어가 다양하다는 뜻.

EDVEffective Diversity Volume. 각 배치의 아이디어가 '얼마나 의외인지(depth)'와 '기존 아이디어와 얼마나 다른지(breadth)'를 곱해서 만든 다양성 점수. 두 조건 모두 충족해야 높은 점수.

ChromaDB오픈소스 벡터 데이터베이스. 텍스트를 숫자 벡터로 저장하고 유사한 벡터를 빠르게 찾아주는 DB. DCE에서 semantic memory 백엔드로 사용.

UMAP고차원 벡터 데이터를 2D로 압축해서 시각화하는 기법. 1536차원 아이디어 임베딩을 2D 점으로 표현해 early/late 배치 분포를 눈으로 비교하는 데 활용.

DeBERTaMicrosoft가 만든 BERT 계열 텍스트 분류 모델. 논문에서는 DCE로 생성한 synthetic data로 학습했을 때 분류 성능이 얼마나 좋은지 downstream 검증용으로 사용.

Related Resources

Original Abstract (Expand)

Large language models produce repetitive output when prompted independently across many batches, a phenomenon we term cross-batch mode collapse: the progressive loss of output diversity when a language model is prompted repeatedly without access to its prior generations. Practitioners have long mitigated this with ad hoc deduplication and seed rotation, but no principled framework exists. We introduce Dynamic Context Evolution (DCE), comprising three mechanisms: (1) verbalized tail sampling (the model labels each idea with a guess about how obvious it is, and obvious ideas are discarded), which filters high-probability candidates via model self-assessment; (2) semantic memory, which maintains a persistent embedding index to reject near-duplicates across batches; and (3) adaptive prompt evolution, which reconstructs the generation prompt each batch using memory state and rotating diversity strategies. In experiments across three domains (sustainable packaging concepts, educational exam questions, and creative writing prompts) and two model families (gpt-5-mini and claude-haiku-4-5), a component ablation across 2-3 random seeds per method shows that DCE achieves 0.0 +/- 0.0% collapse versus 5.6 +/- 2.0% for naive prompting, while producing 17-18 HDBSCAN clusters per seed versus naive's volatile 2-17, indicating reliably richer conceptual structure. These results are validated with an independent embedding model (all-MiniLM-L6-v2) and hold across sensitivity sweeps of the VTS threshold tau and dedup threshold delta. Deduplication and prompt evolution are individually insufficient but jointly effective, at approximately $0.50 per 1,000 candidates using only standard API calls, with no fine-tuning or custom architectures required.