효율적인 LLM Tool Use를 위한 Utility-Guided Agent Orchestration

Utility-Guided Agent Orchestration for Efficient LLM Tool Use

Mar 20, 2026•Boyan Liu, Gongming Zhao, Hongli Xu•View PDF

TL;DR Highlight

ReAct처럼 툴을 무한정 부르는 대신, Gain/Cost/Uncertainty/Redundancy 4가지 점수로 '지금 툴 써야 해?'를 명시적으로 판단하는 에이전트 제어 프레임워크

Who Should Read

LLM 에이전트가 툴을 너무 많이 호출해서 토큰/비용/지연시간이 폭발하는 문제를 겪고 있는 백엔드 개발자. 특히 ReAct 기반 에이전트를 프로덕션에 올리려는데 비용 제어가 안 되는 상황.

Core Mechanics

respond / retrieve / tool_call / verify / stop 5가지 액션 중 어떤 걸 실행할지, Utility Score로 명시적으로 결정하는 오케스트레이션 레이어를 설계함
Utility 공식: U(a|s) = Gain - λ1*StepCost - λ2*Uncertainty - λ3*Redundancy. 각 항목이 에이전트 행동에 얼마나 영향 주는지 분리해서 분석 가능
ReAct은 F1 0.2662로 가장 높지만, 툴 콜을 멈추는 기준이 없어서 비용 제어가 어려움. 이 프레임워크는 F1 0.2360으로 약간 낮지만 왜 멈추는지 추적 가능
Redundancy(중복 툴 호출 페널티) 항목을 semantic 버전으로 바꾸면 토큰 사용량이 1294→1157로 10% 줄고, 평균 툴 콜도 1.56→1.40으로 감소
Ablation 결과: expected_gain이나 stop 정책을 빼면 토큰이 1273→2716으로 2배 이상 폭증하면서 효율(F1/tokens)은 오히려 나빠짐
Llama-3.1-8B 로컬 모델 + BM25 retriever + HotpotQA 200문제 환경에서 실험. 고정 workflow보다 adaptive 방식이 일관되게 우위

Evidence

ReAct F1=0.2662 vs 풀 policy F1=0.2360, 단 ReAct은 546.6 토큰 / policy는 1294.2 토큰으로 policy가 더 비쌈 → 품질-비용 트레이드오프 존재
Ablation: expected_gain 제거 시 토큰 1273→2716(2.1배 증가), F1은 0.2383→0.2621로 소폭 개선에 그쳐 efficiency(F1/tokens)가 0.000187→9.6e-05로 반토막
Semantic redundancy 적용 시 동일 F1(0.237 vs 0.236) 유지하면서 토큰 1294→1157, 툴 콜 1.56→1.40으로 감소
고정 workflow 강화(search-verify) 실험에서 F1이 0.063으로 오히려 최하위, 토큰은 1041로 가장 비싼 케이스 중 하나 → 고정 workflow 한계 확인

How to Apply

ReAct 에이전트 루프 안에 Utility Scorer 레이어를 추가하는 경우: 각 스텝마다 LLM에게 expected_gain(0~1)과 uncertainty(0~1)를 self-estimate하게 하고, step_count 기반 StepCost와 이전 쿼리 유사도 기반 Redundancy를 계산해서 U(a|s) 최고인 액션만 실행하면 됨
비용 민감한 프로덕션 환경에서는 λ1(StepCost 가중치)을 높여서 공격적으로 early stop하고, 품질이 중요한 태스크면 λ2(Uncertainty 가중치)를 높여서 불확실할 때 더 retrieval하도록 파라미터 튜닝
Redundancy 항목을 구현할 때: 단순 exact match보다 이전 tool_call 쿼리들과 cosine similarity 비교(semantic redundancy)를 쓰면 토큰은 줄이면서 F1 유지 가능. 단 현재 구현에서는 wall time이 오히려 늘어나므로 latency보다 토큰 비용이 중요한 상황에 적합

Code Example

snippet

Terminology

ReActLLM이 '생각→행동→관찰'을 반복하며 문제를 푸는 방식. 사람이 구글 검색하면서 생각하는 것처럼 추론과 툴 사용을 번갈아 함. 근데 언제 멈출지 모르는 게 단점.

Orchestration여러 에이전트 액션(검색, 툴 호출, 답변 등)을 언제 어떻게 실행할지 조율하는 제어 로직. 오케스트라 지휘자처럼 각 악기(툴)가 언제 연주할지 결정함.

Utility Score어떤 행동을 할 때의 기대 이득을 수치화한 점수. 쇼핑할 때 '이 물건 살 가치 있나?'를 돈/만족도/필요도로 따지는 것처럼, 에이전트가 다음 액션의 가성비를 계산함.

BM25검색 엔진에서 키워드 빈도 기반으로 관련 문서를 찾는 고전적인 검색 알고리즘. 벡터 임베딩 없이도 빠르고 가벼운 검색이 가능.

HotpotQA멀티홉 추론(여러 문서를 거쳐서 답을 찾아야 하는) 질문-답변 벤치마크 데이터셋. 에이전트의 복잡한 추론 능력 평가에 자주 쓰임.

F1 Score정밀도(맞다고 한 것 중 진짜 맞은 비율)와 재현율(실제 정답 중 맞춘 비율)의 조화평균. 정답의 일부만 맞춰도 부분 점수를 주는 방식.

Ablation모델의 각 구성요소를 하나씩 제거해보면서 '이 부분이 없으면 성능이 어떻게 되나?' 실험하는 것. 요리 레시피에서 재료 하나씩 빼보며 맛 변화 확인하는 것과 비슷.

Related Resources

GitHub: Utility-Guided Agent Orchestration 구현 코드

Original Abstract (Expand)

Tool-using large language model (LLM) agents often face a fundamental tension between answer quality and execution cost. Fixed workflows are stable but inflexible, while free-form multi-step reasoning methods such as ReAct may improve task performance at the expense of excessive tool calls, longer trajectories, higher token consumption, and increased latency. In this paper, we study agent orchestration as an explicit decision problem rather than leaving it entirely to prompt-level behavior. We propose a utility-guided orchestration policy that selects among actions such as respond, retrieve, tool call, verify, and stop by balancing estimated gain, step cost, uncertainty, and redundancy. Our goal is not to claim universally best task performance, but to provide a controllable and analyzable policy framework for studying quality-cost trade-offs in tool-using LLM agents. Experiments across direct answering, threshold control, fixed workflows, ReAct, and several policy variants show that explicit orchestration signals substantially affect agent behavior. Additional analyses on cost definitions, workflow fairness, and redundancy control further demonstrate that lightweight utility design can provide a defensible and practical mechanism for agent control.