PersonaAgent: Test Time에 LLM Agent와 Personalization을 결합하는 프레임워크

PersonaAgent: When Large Language Model Agents Meet Personalization at Test Time

Jun 6, 2025•Weizhi Zhang, Xinyang Zhang, Chenwei Zhang +12•View PDF

TL;DR Highlight

유저마다 다른 'Persona(시스템 프롬프트)'를 실시간으로 최적화해서 AI 에이전트를 개인화하는 프레임워크

Who Should Read

LLM 기반 챗봇이나 에이전트에 사용자별 맞춤 응답 기능을 붙이려는 백엔드/AI 개발자. 특히 추천 시스템, 검색, Q&A 등 유저 히스토리를 활용해야 하는 서비스를 만드는 분께 유용.

Core Mechanics

유저마다 고유한 system prompt(Persona)를 자동 생성하고, 실제 응답과 예상 응답의 차이를 '텍스트 gradient'로 삼아 테스트 타임에 실시간 업데이트함
Episodic Memory(과거 인터랙션 기록)와 Semantic Memory(유저 프로필 요약) 두 가지를 조합해서 유저 컨텍스트를 관리함
Persona가 메모리와 액션 모듈 사이의 중재자 역할을 해서, 웹 검색이나 메모리 조회 같은 툴 호출 방식 자체도 유저에 맞게 바뀜
Fine-tuning 없이 테스트 타임에만 동작하므로 유저 수가 늘어도 모델 재학습 불필요 - 대규모 서비스에 바로 붙이기 적합
Alignment Batch 3개 + Iteration 1회라는 경량 설정으로도 성능이 충분히 올라가고, 반복이 3회 이상이면 오히려 수렴 후 미미하게 하락
Claude-3.5 Sonnet 기준 평가했고, Mistral-Small 같은 소형 모델에도 동일한 개선 효과가 나타나 모델 무관하게 동작

Evidence

LaMP-1(논문 인용 예측) Accuracy: PersonaAgent 0.919 vs 2위 MemBank 0.862 (약 6.6%p 향상)
LaMP-3(제품 평점 예측) MAE: PersonaAgent 0.241 vs 2위 ICL 0.277 (약 13% 개선), RMSE: 0.509 vs 0.543
Ablation 결과 Action 모듈 제거 시 LaMP-2M Accuracy 0.513 → 0.403으로 21% 급락, 컴포넌트 중 가장 큰 영향
Persona Jaccard Similarity 행렬에서 유저 간 오프-다이어그널 유사도가 대부분 0.4 이하로, 유저별 Persona가 실질적으로 구분됨을 정량 확인

How to Apply

기존 RAG 챗봇에 적용하려면: 유저별로 system prompt 파일을 따로 유지하고, 최근 N개(논문에서는 3개) 인터랙션의 예상 응답 vs 실제 응답 차이를 LLM에게 피드백으로 주어 system prompt를 업데이트하는 루프를 추가하면 됨
유저 프로필 DB가 있는 서비스라면: 기존 프로필을 Semantic Memory 초기값으로 쓰고, 세션 로그를 Episodic Memory로 벡터 DB에 저장한 뒤 쿼리 유사도 기반으로 Top-K를 꺼내 프롬프트에 삽입하는 방식으로 바로 시작 가능
비용이 걱정된다면: Alignment Iteration을 1로 고정하고 Batch Size 3~6 정도만 써도 논문 기준 성능 대부분 확보 가능 - 매 요청마다 최적화하지 않고 N번 인터랙션마다 1회만 돌리는 스케줄링으로 비용 통제

Code Example

snippet

# PersonaAgent 핵심 루프 구현 스케치 (LangChain 기반)

from langchain.chat_models import ChatAnthropic
from langchain.vectorstores import FAISS
from langchain.embeddings import BedrockEmbeddings

# 1. 유저별 Persona(system prompt) 초기화
def init_persona(user_profile_summary: str) -> str:
    return f"""You are a helpful personalized assistant.
User summary: {user_profile_summary}
STRICT RULES:
1. Think step-by-step about what information you need.
2. MUST use at least TWO tools to answer the question.
3. Provide clear, concise responses."""

# 2. Episodic Memory 저장 및 검색
class EpisodicMemory:
    def __init__(self, embeddings):
        self.store = {}  # user_id -> FAISS index
        self.embeddings = embeddings

    def add(self, user_id, query, response, metadata={}):
        # 실제론 FAISS나 다른 벡터DB에 저장
        pass

    def retrieve(self, user_id, query, top_k=4):
        # 유사도 기반 Top-K 인터랙션 반환
        pass

# 3. Test-Time Persona 최적화 핵심 로직
def compute_textual_gradient(llm, question, agent_response, ground_truth):
    """응답 차이를 LLM에게 분석시켜 피드백 생성"""
    prompt = f"""You are a meticulous evaluator of personalized AI agent responses.
Analyze the following and give feedback on how to improve the system prompt.

Question: {question}
Expected Answer: {ground_truth}
Agent Response: {agent_response}

Feedback should focus on:
1. How to improve search keywords for this user.
2. User's prior interactions and preferences.
3. Explicit user profile descriptions not specific to this task.

Feedback:"""
    return llm.predict(prompt)

def update_persona(llm, current_persona, feedbacks):
    """피드백 모아서 Persona 업데이트"""
    aggregated = "\n".join(feedbacks)
    prompt = f"""You are a prompt engineering assistant.
Current system prompt: {current_persona}
Provided Feedback: {aggregated}

Generate an updated system prompt that highlights the user's unique preferences.
New system prompt:"""
    return llm.predict(prompt)

# 4. 테스트 타임 정렬 루프
def test_time_alignment(llm, agent, persona, recent_interactions, n_iter=1):
    for _ in range(n_iter):
        feedbacks = []
        for q, gt in recent_interactions:
            agent_resp = agent.run(q, system_prompt=persona)
            feedback = compute_textual_gradient(llm, q, agent_resp, gt)
            feedbacks.append(feedback)
        persona = update_persona(llm, persona, feedbacks)
    return persona

Terminology

Episodic Memory사람의 '에피소드 기억'에서 따온 개념. '언제, 무엇을 했는지' 구체적인 과거 인터랙션을 타임스탬프와 함께 저장하는 로그. 일기장과 비슷하게 세부 사건을 기록함.

Semantic Memory사람의 '의미 기억'에서 따온 개념. 구체적 사건이 아닌 '이 유저는 SF를 좋아한다'처럼 추상화된 유저 특성/성향 프로필. 여러 에피소드를 요약해서 만듦.

Persona이 논문에서는 각 유저마다 다르게 설정되는 system prompt를 뜻함. AI가 매 응답 전에 읽는 '이 유저에 대한 지침서' 역할을 하고, 테스트 타임에 자동으로 업데이트됨.

Test-Time Alignment모델을 재학습하지 않고, 실제 서비스 중(inference 시점)에 유저 응답 피드백으로 프롬프트를 업데이트하는 방식. 훈련 없이 개인화를 달성하는 핵심 아이디어.

Textual Gradient딥러닝의 수치 gradient를 텍스트로 대체한 개념. '실제 답과 에이전트 답의 차이'를 LLM이 언어로 분석해서 '이렇게 프롬프트를 바꿔라'는 텍스트 피드백을 만들어냄.

PAG (Profile-Augmented Generation)RAG(검색 증강 생성)에서 더 나아가, 유저 프로필 요약까지 프롬프트에 추가하는 방식. PersonaAgent가 비교 대상으로 삼은 기존 개인화 방법 중 하나.

ReActReasoning + Acting의 합성어. LLM이 '생각→행동→관찰' 루프를 반복하면서 툴을 사용하는 에이전트 패턴. 이 논문에서 비교 baseline으로 사용됨.

LaMPLLM Personalization 평가를 위한 벤치마크 데이터셋. 논문 인용 예측, 영화 태깅, 뉴스 분류, 상품 평점 예측 등 실제 유저 데이터 기반 4개 태스크로 구성됨.

Related Resources

Original Abstract (Expand)

Large Language Model (LLM) empowered agents have recently emerged as advanced paradigms that exhibit impressive capabilities in a wide range of domains and tasks. Despite their potential, current LLM agents often adopt a one-size-fits-all approach, lacking the flexibility to respond to users' varying needs and preferences. This limitation motivates us to develop PersonaAgent, the first personalized LLM agent framework designed to address versatile personalization tasks. Specifically, PersonaAgent integrates two complementary components - a personalized memory module that includes episodic and semantic memory mechanisms; a personalized action module that enables the agent to perform tool actions tailored to the user. At the core, the persona (defined as unique system prompt for each user) functions as an intermediary: it leverages insights from personalized memory to control agent actions, while the outcomes of these actions in turn refine the memory. Based on the framework, we propose a test-time user-preference alignment strategy that simulate the latest n interactions to optimize the persona prompt, ensuring real-time user preference alignment through textual loss feedback between simulated and ground-truth responses. Experimental evaluations demonstrate that PersonaAgent significantly outperforms other baseline methods by not only personalizing the action space effectively but also scaling during test-time real-world applications. These results underscore the feasibility and potential of our approach in delivering tailored, dynamic user experiences.