PersonaTrace: LLM 에이전트로 현실적인 디지털 발자국 합성하기

PersonaTrace: Synthesizing Realistic Digital Footprints with LLM Agents

Mar 12, 2026•Minjia Wang, Yunfeng Wang, Xiao Ma +9•View PDF

TL;DR Highlight

개인 프로필에서 출발해 이메일·메시지·캘린더 등 현실적인 디지털 기록을 자동으로 생성하는 LLM 에이전트 프레임워크

Who Should Read

개인화 서비스나 디지털 어시스턴트 학습용 데이터가 부족해서 합성 데이터 생성을 고민하는 ML 엔지니어. GDPR 등 규제로 실제 사용자 데이터를 쓰기 어려운 상황의 데이터 파이프라인 담당자.

Core Mechanics

3단계 에이전트 파이프라인: Persona Agent(프로필 생성) → Event Agent(이벤트 트리 확장) → Artifact Generator Agent(이메일·메시지·캘린더 등 생성) 순으로 동작
Event Agent가 '학술 컨퍼런스 참석'같은 씨앗 이벤트를 '포스터 준비 → 항공권 예약 → 탑승권 수령'처럼 재귀적으로 서브 이벤트 트리로 확장해서 맥락이 일관된 데이터 생성
Generator-Critic 루프로 품질 관리: 생성된 아티팩트를 3개의 Critic Agent가 일관성·현실성·유창성 기준으로 검토하고, 기준 통과까지 반복 수정
인구통계 분포는 2022년 미국 인구조사(American Community Survey) 기반으로 샘플링해서 현실과 동떨어진 편향을 방지
Gemini-1.5-Pro(temperature 0.9)를 공통 백본으로 쓰되, 각 에이전트에 역할별 프롬프트와 제약을 따로 부여하는 구조
실제 서비스 적용 결과: 온라인 검색 제품의 Recall@10이 7% 절대 향상

Evidence

합성 데이터셋 중 다양성 최고: Pairwise Correlation 0.2093(낮을수록 좋음), Remote-Clique 0.7898, Entropy 2.8305 — 모두 비교 합성 데이터셋 1위
LLM-As-Judge 품질 점수 Overall 4.79/5.0 — FinePersonas-Email(4.39), Synthetic-Satellite-Emails(4.64) 등 모든 합성·실제 데이터셋 중 1위
Mistral-7B-v0.1을 PersonaTrace로 파인튜닝 시 이메일 분류 Enron 기준 Accuracy 0.6100으로 FinePersonas-Email(0.5908) 대비 우위, QA 태스크 ROUGE 0.4435로 모든 합성 기준선 압도
에이전트 유무 Ablation: 에이전트 제거 버전 대비 이메일 분류 Accuracy 0.0063 → 0.2733, QA BERTScore 0.2880 → 0.4405로 대폭 향상

How to Apply

사용자 개인화 AI 어시스턴트 학습 데이터가 부족한 경우: 인구통계 분포 CSV를 입력으로 Persona Agent 프롬프트를 돌리고, 나온 프로필로 Event → Artifact 파이프라인을 순차 실행해 이메일/메시지 합성 데이터셋을 만들면 됨
이메일 분류·초안 작성 모델을 Mistral-7B 같은 오픈소스로 파인튜닝할 때: LoRA(r=8, α=16, dropout=0.05, lr=5e-5)로 합성 데이터 4,000개만 써도 실제 데이터셋에서 경쟁력 있는 성능 달성 가능
Generator-Critic 루프 패턴 자체를 다른 합성 데이터 파이프라인에 적용: 생성 → 3가지 기준(일관성·현실성·유창성)으로 Critic LLM 평가 → 피드백 반영 재생성을 최대 5회 반복하는 구조로 품질 관리

Code Example

snippet

Terminology

Digital Footprint사람이 디지털 시스템을 사용하면서 남기는 흔적들. 이메일, 메시지, 캘린더, 구매 내역 등 디지털 활동의 모든 기록.

Event Forest하나의 큰 이벤트(예: 컨퍼런스 참석)를 뿌리로, 관련 서브 이벤트들(항공권 예약, 포스터 준비 등)이 가지처럼 펼쳐진 트리 구조의 집합. 여러 트리가 모여 숲(forest)을 이룸.

Generator-Critic LoopAI가 콘텐츠를 생성하면 다른 AI가 품질을 평가하고 피드백을 주고, 다시 수정하는 반복 과정. 사람이 초안 쓰고 편집자가 검토하는 것과 비슷.

LoRA모델 전체를 재학습하지 않고 작은 보조 레이어만 학습해서 파인튜닝하는 기법. 원본 모델은 동결하고 얇은 어댑터만 끼워서 특정 작업에 적응시킴.

LLM-As-Judge사람 평가자 대신 GPT-4나 Gemini 같은 대형 언어 모델이 텍스트 품질을 채점하는 평가 방식.

MinHash LSH대량의 텍스트 중 거의 똑같은 내용(중복)을 빠르게 찾아 제거하는 알고리즘. 정확한 비교 대신 해시값으로 근사 중복을 탐지함.

ROUGE생성된 텍스트가 정답과 얼마나 겹치는지 n-gram 단위로 측정하는 자동 평가 지표. 요약·번역 품질 측정에 많이 쓰임.

BERTScore단순 단어 일치 대신 BERT 임베딩 유사도로 텍스트 품질을 측정하는 지표. 의미가 같으면 단어가 달라도 높은 점수를 줌.

Related Resources

Original Abstract (Expand)

Digital footprints (records of individuals' interactions with digital systems) are essential for studying behavior, developing personalized applications, and training machine learning models. However, research in this area is often hindered by the scarcity of diverse and accessible data. To address this limitation, we propose a novel method for synthesizing realistic digital footprints using large language model (LLM) agents. Starting from a structured user profile, our approach generates diverse and plausible sequences of user events, ultimately producing corresponding digital artifacts such as emails, messages, calendar entries, reminders, etc. Intrinsic evaluation results demonstrate that the generated dataset is more diverse and realistic than existing baselines. Moreover, models fine-tuned on our synthetic data outperform those trained on other synthetic datasets when evaluated on real-world out-of-distribution tasks.