PersonaTrace: Synthesizing Realistic Digital Footprints with LLM Agents

Mar 12, 2026•Minjia Wang, Yunfeng Wang, Xiao Ma +9•View PDF

TL;DR Highlight

An LLM agent framework that starts from personal profiles and automatically generates realistic digital records like emails, messages, and calendar entries.

Who Should Read

ML engineers with insufficient personalized service or digital assistant training data who are considering synthetic data generation. Data pipeline engineers in situations where real user data is hard to use due to GDPR and similar regulations.

Core Mechanics

3-stage agent pipeline: Persona Agent (profile generation) → Event Agent (event tree expansion) → Artifact Generator Agent (generates emails/messages/calendar/etc.)
Event Agent recursively expands seed events like "attend academic conference" into sub-event trees like "poster preparation → flight booking → boarding pass receipt" to generate contextually consistent data
Generator-Critic loop for quality control: 3 Critic Agents review generated artifacts on consistency/realism/fluency criteria, repeating revision until passing
Demographic distribution sampled from 2022 US Census (American Community Survey) to prevent unrealistic biases
Uses Gemini-1.5-Pro (temperature 0.9) as the common backbone with separate role prompts and constraints for each agent
Real service application result: 7% absolute improvement in Recall@10 for an online search product

Evidence

Highest diversity among synthetic datasets: Pairwise Correlation 0.2093 (lower is better), Remote-Clique 0.7898, Entropy 2.8305 — #1 in all comparison synthetic datasets
LLM-As-Judge quality score Overall 4.79/5.0 — #1 among all synthetic and real datasets including FinePersonas-Email (4.39) and Synthetic-Satellite-Emails (4.64)
Fine-tuning Mistral-7B-v0.1 on PersonaTrace: email classification Enron Accuracy 0.6100 vs FinePersonas-Email (0.5908), QA task ROUGE 0.4435 surpassing all synthetic baselines
Agent ablation: without agent vs with agent — email classification Accuracy 0.0063 → 0.2733, QA BERTScore 0.2880 → 0.4405

How to Apply

When short on user-personalized AI assistant training data: feed a demographic distribution CSV as input to the Persona Agent prompt, then run the Event → Artifact pipeline sequentially on the output profiles to generate email/message synthetic datasets
When fine-tuning email classification/drafting models on open-source models like Mistral-7B: LoRA (r=8, α=16, dropout=0.05, lr=5e-5) with just 4,000 synthetic data points can achieve competitive performance on real datasets
Apply the Generator-Critic loop pattern to other synthetic data pipelines: generate → evaluate with Critic LLM on 3 criteria (consistency/realism/fluency) → regenerate with feedback, up to 5 iterations for quality control

Code Example

snippet

Terminology

Digital FootprintTraces left by people using digital systems. All records of digital activity: emails, messages, calendars, purchase histories, etc.

Event ForestA collection of tree structures where a major event (e.g., conference attendance) is the root, with related sub-events (flight booking, poster preparation, etc.) branching out. Multiple trees form a forest.

Generator-Critic LoopAn iterative process where AI generates content, another AI evaluates quality and provides feedback, and the first AI revises. Similar to a writer writing a draft and an editor reviewing it.

LoRAA fine-tuning technique that trains only small auxiliary layers rather than retraining the entire model. Freezes the original model and plugs in a thin adapter to adapt to specific tasks.

LLM-As-JudgeAn evaluation method where large language models like GPT-4 or Gemini score text quality instead of human evaluators.

MinHash LSHAn algorithm for quickly finding near-duplicate content (duplicates) in large text collections. Detects approximate duplicates by hash values rather than exact comparison.

ROUGEAn automated evaluation metric measuring how much generated text overlaps with reference text by n-gram. Widely used for summarization and translation quality measurement.

BERTScoreA metric measuring text quality using BERT embedding similarity instead of simple word matching. Gives high scores even when wording differs but meaning is the same.

Related Resources

Original Abstract (Expand)

Digital footprints (records of individuals' interactions with digital systems) are essential for studying behavior, developing personalized applications, and training machine learning models. However, research in this area is often hindered by the scarcity of diverse and accessible data. To address this limitation, we propose a novel method for synthesizing realistic digital footprints using large language model (LLM) agents. Starting from a structured user profile, our approach generates diverse and plausible sequences of user events, ultimately producing corresponding digital artifacts such as emails, messages, calendar entries, reminders, etc. Intrinsic evaluation results demonstrate that the generated dataset is more diverse and realistic than existing baselines. Moreover, models fine-tuned on our synthetic data outperform those trained on other synthetic datasets when evaluated on real-world out-of-distribution tasks.