LLM 기반 Agent 최적화 기법 종합 Survey

A Survey on the Optimization of Large Language Model-based Agents

Mar 16, 2025•Shangheng Du, Jiabao Zhao, Jinxin Shi +4•View PDF

TL;DR Highlight

LLM 에이전트를 더 잘 만들고 싶다면, 파인튜닝부터 RL, 프롬프트 엔지니어링까지 모든 최적화 기법을 한 곳에서 정리한 서베이 논문.

Who Should Read

LLM 기반 에이전트 시스템을 개발하거나 개선하려는 ML 엔지니어 및 연구자. 특히 어떤 최적화 전략(파인튜닝 vs RL vs 프롬프트)을 선택할지 고민 중인 개발자에게 유용.

Core Mechanics

에이전트 최적화 방법을 크게 두 가지로 분류: 모델 파라미터를 직접 바꾸는 'Parameter-driven'(파인튜닝, RL, 하이브리드)과 파라미터는 그대로 두는 'Parameter-free'(프롬프트 엔지니어링, RAG, 멀티에이전트 협업)
파인튜닝용 trajectory(에이전트 행동 궤적) 데이터 만드는 방법 4가지: 전문가 직접 주석, GPT-4 같은 강한 모델 생성, 에이전트 자기탐색, 멀티에이전트 협업 — 각각 품질/비용/확장성 트레이드오프가 다름
RL 기반 최적화는 환경 보상 기반(Environment-based), 모델 보상 기반(Model-based), 커스텀 보상 함수로 나뉘며, PPO와 DPO가 가장 많이 쓰임 — 기반 모델은 Llama-3, Mistral-7B, Qwen2.5 계열이 주류
SFT 단독보다 SFT→RL 순서의 하이브리드(Sequential Hybrid)가 효과적: 먼저 고품질 trajectory로 SFT 워밍업 후 PPO/DPO로 정책 정제 — OpenAI의 RFT(Reinforcement Fine-Tuning)와 같은 패러다임
Parameter-free 방법 5가지: 과거 경험 기반(ExpeL, Reflexion), 피드백 기반 자기반성, 메타 프롬프트 최적화(OPRO), 도구 사용 최적화, RAG 기반 외부지식 통합
주요 평가 벤치마크: 수학(GSM8K, MATH), 코드(SWE-bench, HumanEval), 웹 네비게이션(WebArena, Mind2Web), 도구 사용(T-Eval, ToolEval), 멀티태스크(AgentBench, GAIA) 등 도메인별로 다름

Evidence

AgentBank 데이터셋: 16개 태스크, 51,287개 필터링된 trajectory 포함 — 가장 큰 규모의 에이전트 튜닝 데이터셋 중 하나
SMART-Trajectory 데이터셋: 17개 태스크, 142,507개 trajectory로 현재 논문에서 소개된 데이터셋 중 최대 규모
Llama-3-70B 기반 Agent Q가 DPO 적용으로 웹 에이전트 태스크에서 성능 향상, xLAM-v0.1-r 모델도 동일 방법 적용
하이브리드 전략(SFT+RL) 사용 사례: ETO(BC+DPO), AGILE(BC+PPO), SaySelf(SFT+PPO), OPTIMA(SFT+DPO) 등 8개 이상의 논문에서 단순 SFT 대비 개선 확인

How to Apply

오픈소스 에이전트(Llama-3-8B, Mistral-7B 등)를 특정 도메인에 맞게 최적화하려면: GPT-4로 ReAct 스타일 trajectory 생성 → 환경 피드백으로 필터링 → LoRA로 파인튜닝 → 이후 DPO로 선호도 정렬하는 순서로 적용
파인튜닝 비용이 부담된다면 Parameter-free 방법 먼저 시도: Reflexion 패턴(실패 시 자기반성 후 재시도)을 프롬프트에 추가하거나, ExpeL처럼 과거 성공/실패 경험을 메모리에 저장해 다음 실행에 참고하게 구성
멀티에이전트 시스템 구축 시 LangGraph(워크플로우 중심) vs AutoGen(대화 협업 중심) vs CrewAI(역할 기반 팀) 중 태스크 특성에 맞게 선택 — 복잡한 소프트웨어 엔지니어링이면 MetaGPT/ChatDev 패턴 참고

Code Example

snippet

# Reflexion 스타일 자기반성 프롬프트 패턴 (Parameter-free 최적화)
system_prompt = """
You are an autonomous agent. Follow this process:
1. THINK: Analyze the task and plan your approach
2. ACT: Execute an action using available tools
3. OBSERVE: Check the result
4. REFLECT: If failed, analyze why and adjust strategy
5. Repeat until task is complete

If you encounter an error:
- Identify what went wrong
- Generate a corrected plan
- Do NOT repeat the same mistake
"""

# DPO용 선호도 데이터 구조 예시
preference_data = {
    "instruction": "웹에서 오늘 날씨를 검색해서 알려줘",
    "chosen": [
        {"role": "assistant", "content": "[SEARCH] 오늘 날씨 검색\n[RESULT] 서울 18도, 맑음\n최종 답변: 서울 현재 기온은 18도이며 맑습니다."}
    ],
    "rejected": [
        {"role": "assistant", "content": "날씨 정보를 알 수 없습니다. 직접 확인해주세요."}
    ]
}

# LoRA 기반 에이전트 파인튜닝 설정 (HuggingFace PEFT)
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
lora_config = LoraConfig(
    r=16,           # rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()  # 전체의 ~1%만 학습

Terminology

Trajectory에이전트가 태스크를 수행할 때 거친 행동 기록 전체. '생각→행동→관찰' 시퀀스를 담은 로그라고 보면 됨. 파인튜닝의 훈련 데이터로 사용됨.

SFT모범 답안을 보여주고 따라하게 학습시키는 방법(Supervised Fine-Tuning). 학교에서 예제 풀이 보고 따라 푸는 것과 비슷. 에이전트에선 좋은 trajectory를 학습.

RL환경과 상호작용하며 보상을 최대화하는 방향으로 학습하는 강화학습(Reinforcement Learning). 게임에서 점수를 높이는 방향으로 학습하는 것과 유사.

PPO정책 학습 알고리즘 중 하나로, 너무 급격하게 바뀌지 않도록 제약을 걸면서 보상을 최적화함(Proximal Policy Optimization). LLM 파인튜닝에서 RLHF에 자주 사용.

DPO좋은 응답과 나쁜 응답 쌍을 비교해 직접 정책을 최적화하는 방법(Direct Preference Optimization). PPO보다 구현이 단순하고 보상 모델이 따로 필요 없음.

LoRA모델 전체를 다 바꾸지 않고 작은 어댑터 행렬만 추가해 학습하는 파라미터 효율적 파인튜닝 기법. 안경처럼 끼웠다 뺄 수 있고, GPU 메모리를 훨씬 적게 씀.

ReActReasoning + Acting의 합성어. '생각(Thought) → 행동(Action) → 관찰(Observation)'을 반복하는 에이전트 프롬프트 패턴. 에이전트 trajectory 구성의 표준 포맷.

RAG외부 데이터베이스나 문서에서 관련 정보를 검색해 LLM 답변에 활용하는 방법(Retrieval-Augmented Generation). 모델이 모르는 최신 정보도 답할 수 있게 해줌.

Related Resources

Original Abstract (Expand)

With the rapid development of Large Language Models (LLMs), LLM-based agents have been widely adopted in various fields, becoming essential for autonomous decision-making and interactive tasks. However, current work typically relies on prompt design or fine-tuning strategies applied to vanilla LLMs, which often leads to limited effectiveness in complex agent-related environments. Although numerous recent studies have explored various strategies to optimize LLM-based agents for complex agent tasks, a systematic review summarizing and comparing these methods from a holistic perspective remains lacking. In this survey, we provide a comprehensive review of LLM-based agent optimization approaches, categorizing them into parameter-driven and parameter-free methods. We first focus on parameter-driven optimization, covering fine-tuning-based optimization, reinforcement learning-based optimization, and hybrid strategies, analyzing key aspects such as trajectory data construction, reward function design, and optimization algorithms. Additionally, we briefly discuss parameter-free strategies that optimize agent behavior through prompt engineering and external knowledge retrieval. Finally, we summarize the evaluation for agents, review key applications of LLM-based agents, and discuss the major challenges and promising future directions. A curated collection of the surveyed works is provided at https://github.com/YoungDubbyDu/LLM-Agent-Optimization.