PersonaTrace: Synthesizing Realistic Digital Footprints with LLM Agents
TL;DR Highlight
An LLM agent framework that starts from personal profiles and automatically generates realistic digital records like emails, messages, and calendar entries.
Who Should Read
ML engineers with insufficient personalized service or digital assistant training data who are considering synthetic data generation. Data pipeline engineers in situations where real user data is hard to use due to GDPR and similar regulations.
Core Mechanics
- 3-stage agent pipeline: Persona Agent (profile generation) → Event Agent (event tree expansion) → Artifact Generator Agent (generates emails/messages/calendar/etc.)
- Event Agent recursively expands seed events like "attend academic conference" into sub-event trees like "poster preparation → flight booking → boarding pass receipt" to generate contextually consistent data
- Generator-Critic loop for quality control: 3 Critic Agents review generated artifacts on consistency/realism/fluency criteria, repeating revision until passing
- Demographic distribution sampled from 2022 US Census (American Community Survey) to prevent unrealistic biases
- Uses Gemini-1.5-Pro (temperature 0.9) as the common backbone with separate role prompts and constraints for each agent
- Real service application result: 7% absolute improvement in Recall@10 for an online search product
Evidence
- Highest diversity among synthetic datasets: Pairwise Correlation 0.2093 (lower is better), Remote-Clique 0.7898, Entropy 2.8305 — #1 in all comparison synthetic datasets
- LLM-As-Judge quality score Overall 4.79/5.0 — #1 among all synthetic and real datasets including FinePersonas-Email (4.39) and Synthetic-Satellite-Emails (4.64)
- Fine-tuning Mistral-7B-v0.1 on PersonaTrace: email classification Enron Accuracy 0.6100 vs FinePersonas-Email (0.5908), QA task ROUGE 0.4435 surpassing all synthetic baselines
- Agent ablation: without agent vs with agent — email classification Accuracy 0.0063 → 0.2733, QA BERTScore 0.2880 → 0.4405
How to Apply
- When short on user-personalized AI assistant training data: feed a demographic distribution CSV as input to the Persona Agent prompt, then run the Event → Artifact pipeline sequentially on the output profiles to generate email/message synthetic datasets
- When fine-tuning email classification/drafting models on open-source models like Mistral-7B: LoRA (r=8, α=16, dropout=0.05, lr=5e-5) with just 4,000 synthetic data points can achieve competitive performance on real datasets
- Apply the Generator-Critic loop pattern to other synthetic data pipelines: generate → evaluate with Critic LLM on 3 criteria (consistency/realism/fluency) → regenerate with feedback, up to 5 iterations for quality control
Code Example
Terminology
Related Resources
Original Abstract (Expand)
Digital footprints (records of individuals' interactions with digital systems) are essential for studying behavior, developing personalized applications, and training machine learning models. However, research in this area is often hindered by the scarcity of diverse and accessible data. To address this limitation, we propose a novel method for synthesizing realistic digital footprints using large language model (LLM) agents. Starting from a structured user profile, our approach generates diverse and plausible sequences of user events, ultimately producing corresponding digital artifacts such as emails, messages, calendar entries, reminders, etc. Intrinsic evaluation results demonstrate that the generated dataset is more diverse and realistic than existing baselines. Moreover, models fine-tuned on our synthetic data outperform those trained on other synthetic datasets when evaluated on real-world out-of-distribution tasks.