LLM 기반 자율 에이전트 종합 Survey: 구성, 응용, 평가

A survey on large language model based autonomous agents

Aug 22, 2023•Lei Wang, Chengbang Ma, Xueyang Feng +10•View PDF

TL;DR Highlight

LLM 기반 자율 에이전트의 아키텍처 설계부터 응용, 평가까지 체계적으로 정리한 종합 서베이.

Who Should Read

LLM 에이전트 시스템을 처음 설계하거나 기존 파이프라인에 에이전트 패턴을 도입하려는 백엔드/AI 개발자. AutoGPT, LangChain 같은 프레임워크를 쓰면서 내부 구조를 더 깊이 이해하고 싶은 엔지니어.

Core Mechanics

에이전트 아키텍처를 Profile(역할 설정) → Memory(기억) → Planning(계획) → Action(행동) 4모듈로 통합 정리 — 대부분의 기존 연구가 이 프레임에 들어맞음
Memory는 단기(컨텍스트 윈도우)와 장기(벡터 DB)를 결합한 Hybrid Memory가 가장 효과적이며, 메모리 읽기 기준은 최신성(recency) + 관련성(relevance) + 중요도(importance) 3가지 가중합으로 공식화됨
Planning은 피드백 없는 방식(CoT, ToT 등)과 피드백 있는 방식(ReAct, Reflexion 등)으로 나뉘며, 복잡한 태스크엔 환경/사람/모델 피드백을 받는 방식이 훨씬 강력함
에이전트 능력 획득 전략은 파인튜닝(LLaMA 등 오픈소스)과 파인튜닝 없는 방식(프롬프트 엔지니어링 + 메커니즘 엔지니어링)으로 구분 — 클로즈드 모델엔 후자만 가능
응용 분야는 소셜 시뮬레이션, 화학 실험 자동화(ChemCrow), 소프트웨어 개발(ChatDev, MetaGPT), 로보틱스(SayCan, Voyager)까지 매우 광범위함
주요 과제로 Hallucination, 프롬프트 취약성, Knowledge Boundary(LLM이 너무 많이 알아서 시뮬레이션이 부정확해지는 문제), 추론 효율성이 지적됨

Evidence

ToolBench 데이터셋은 RapidAPI Hub에서 49개 카테고리의 실제 API 16,464개를 수집, 이를 기반으로 LLaMA를 파인튜닝해 툴 사용 능력 유의미하게 향상
WebShop 벤치마크는 아마존 실제 상품 118만 개를 활용해 에이전트의 상품 검색·구매 능력 평가, 13명 작업자 대비 LLM 에이전트 성능을 정량 측정
AgentBench는 다양한 실제 환경에서 LLM을 에이전트로 평가하는 최초의 체계적 프레임워크로, 여러 도메인에 걸쳐 성능 비교 제공
MIND2WEB은 31개 도메인의 실제 웹사이트 137개에서 2,000개 이상의 오픈엔드 태스크를 수집해 웹 에이전트 파인튜닝에 활용

How to Apply

에이전트에 메모리 모듈 붙일 때: 단순 컨텍스트(단기 메모리)만 쓰면 윈도우 초과 문제가 생기니, 중요 정보는 벡터 DB에 임베딩으로 저장하고 recency+relevance+importance 가중합으로 검색하는 Hybrid Memory 구조를 적용하면 됨
복잡한 멀티스텝 태스크 자동화할 때: 처음부터 전체 플랜 생성(CoT)보다 ReAct 패턴(생각→행동→관찰 반복)이나 Reflexion(실패 후 언어 피드백으로 재시도)을 쓰면 실패 복구가 훨씬 쉬움
역할 기반 멀티 에이전트 시스템 만들 때: ChatDev/MetaGPT처럼 PM, 아키텍트, 개발자 역할을 분리하고, 각 에이전트 프로필을 시스템 프롬프트에 명확히 정의하면 협업 품질과 출력 일관성이 올라감

Code Example

snippet

# ReAct 패턴 기반 에이전트 프롬프트 예시
SYSTEM_PROMPT = """
You are an autonomous agent. Follow the Thought-Action-Observation loop.

Format:
Thought: [reasoning about what to do next]
Action: [tool_name(param1, param2)]
Observation: [result from tool]
... (repeat as needed)
Final Answer: [your final response]

Available tools:
- search(query): Search the web
- calculator(expression): Evaluate math
- memory_read(query): Retrieve from memory
- memory_write(content): Store to memory
"""

# Hybrid Memory 검색 스코어링 예시
def score_memory(query_embedding, memory_item, current_time):
    """
    논문 수식: m* = argmax α*s_rec + β*s_rel + γ*s_imp
    """
    alpha, beta, gamma = 0.3, 0.5, 0.2  # 가중치 조정 가능
    
    # 최신성: 시간이 지날수록 감소
    time_diff = current_time - memory_item['timestamp']
    s_recency = 1.0 / (1.0 + time_diff.seconds / 3600)
    
    # 관련성: 코사인 유사도
    s_relevance = cosine_similarity(query_embedding, memory_item['embedding'])
    
    # 중요도: 사전에 LLM이 1-10으로 평가해 저장
    s_importance = memory_item['importance'] / 10.0
    
    return alpha * s_recency + beta * s_relevance + gamma * s_importance

Terminology

ReAct생각(Thought)→행동(Action)→관찰(Observation)을 반복하는 에이전트 패턴. 사람이 요리할 때 '뭘 할지 생각 → 재료 넣기 → 맛 확인'을 반복하는 것과 같음.

Reflexion에이전트가 실패했을 때 '왜 실패했는지'를 언어로 정리하고 다음 시도에 반영하는 기법. 시험 틀린 문제를 오답노트로 정리해 다음에 안 틀리는 것과 비슷.

Chain of Thought (CoT)LLM이 답을 바로 내놓지 않고 중간 추론 단계를 거치게 하는 프롬프트 기법. 수학 문제 풀 때 풀이 과정 적게 하는 것과 동일.

Tree of Thoughts (ToT)한 줄기로만 생각하지 않고 여러 가능성을 나무처럼 펼쳐서 탐색하는 추론 방법. 체스에서 여러 수를 미리 계산해보는 것과 비슷.

HallucinationLLM이 틀린 정보를 마치 사실인 것처럼 자신 있게 생성하는 현상. 모르면서 아는 척하는 것.

Hybrid Memory단기 기억(현재 대화 컨텍스트)과 장기 기억(벡터 DB에 저장된 과거 경험)을 함께 쓰는 메모리 구조. RAM(단기)과 SSD(장기)를 동시에 쓰는 컴퓨터와 유사.

Embodied AI물리적 환경(로봇, 게임 월드 등)에서 직접 행동하며 학습하는 AI. 책으로만 배우지 않고 몸으로 부딪히며 배우는 것.

PDDL계획 문제를 형식적으로 표현하는 언어(Planning Domain Definition Language). 에이전트가 '현재 상태→목표 상태'를 달성하는 행동 시퀀스를 외부 플래너로 최적 계산할 때 씀.

Related Resources

Original Abstract (Expand)

Autonomous agents have long been a research focus in academic and industry communities. Previous research often focuses on training agents with limited knowledge within isolated environments, which diverges significantly from human learning processes, and makes the agents hard to achieve human-like decisions. Recently, through the acquisition of vast amounts of Web knowledge, large language models (LLMs) have shown potential in human-level intelligence, leading to a surge in research on LLM-based autonomous agents. In this paper, we present a comprehensive survey of these studies, delivering a systematic review of LLM-based autonomous agents from a holistic perspective. We first discuss the construction of LLM-based autonomous agents, proposing a unified framework that encompasses much of previous work. Then, we present a overview of the diverse applications of LLM-based autonomous agents in social science, natural science, and engineering. Finally, we delve into the evaluation strategies commonly used for LLM-based autonomous agents. Based on the previous studies, we also present several challenges and future directions in this field.