MADQA: Multimodal Agent가 PDF 문서 컬렉션을 탐색하는 방식 — 전략적 추론인가, 무작위 검색인가?

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Mar 12, 2026•Łukasz Borchmann, Jordy Van Landeghem, Michał Turski +12•View PDF

TL;DR Highlight

800개 PDF, 2250개 질문으로 구성된 MADQA 벤치마크로 테스트해보니 최고 AI 에이전트도 사람처럼 '전략적으로' 문서를 탐색하지 못하고 무식하게 반복 검색에 의존한다는 게 드러났다.

Who Should Read

PDF 문서 기반 RAG 파이프라인이나 document QA 에이전트를 개발 중인 AI 엔지니어. 특히 멀티홉 검색(여러 문서를 넘나드는 검색)이나 에이전트 성능 평가 방법론에 관심 있는 개발자.

Core Mechanics

최고 성능 에이전트(Gemini 3 Pro BM25 Agent 82.2%)는 사람(BM25 환경 82.2%)과 정확도는 같지만, 정답을 맞히는 문제가 서로 달라서 Cohen's κ = 0.24에 불과 — 사람과 AI는 완전히 다른 방식으로 문제를 푼다
사람은 첫 번째 검색 쿼리에서 50% 정확도를 달성하는 반면, Gemini 3 Pro는 첫 쿼리에서 겨우 12% — AI는 cold start 문제가 심각하고 검색을 많이 해야 겨우 따라잡는다
Oracle(완벽한 검색 제공)과의 격차가 약 18%인데, 이는 추론 능력이 아니라 검색(retrieval) 자체가 병목임을 의미
제약 없는 RLM(Recursive Language Models) 방식은 Claude Sonnet 4.5 기준 2억 7천만 토큰($850)을 쓰고도 BM25 Agent보다 성능이 낮음 — 검색 도구로 제약을 두는 게 비용 효율적
에이전트 오류의 35.7%는 잘못된 문서 검색, 28.8%는 맞는 페이지인데 틀린 답(이해 실패), 23%는 맞는 문서인데 틀린 페이지(내비게이션 실패) — 검색 자체가 가장 큰 병목
쿼리 reformulation(검색어 재작성) 시 코사인 드리프트가 클수록 정확도가 높음 — Claude Sonnet 4.5의 평균 드리프트 0.38 vs GPT-4.1 Nano 0.10, 검색 실패 시 과감하게 다른 쿼리를 써야 한다

Evidence

Gemini 3 Pro BM25 Agent 82.2% vs 동일 모델 File Search(Static RAG) 78.6% — 에이전트 방식이 3.6%p 우위, Claude Sonnet 4.5 RLM($850 소모)은 BM25 Agent(80.6%)보다 낮은 70.5% 달성
Human Oracle Retriever 99.4% vs 최고 에이전트 82.2% — 약 18%의 Oracle Gap 존재, 사람이 같은 BM25 도구를 쓸 때도 Kuiper 통계(effort calibration 지표) 14.6으로 모든 에이전트(22.9~73.2)보다 낮아 훨씬 효율적
멀티홉 질문에서 증거 페이지 간 의미적 거리(semantic distance)가 0~0.15이면 정확도 72.4%이지만 0.6 이상이면 34.8%로 38%p 하락 — 물리적 페이지 거리는 무관
질문의 58%가 표, 폼, 차트 등 시각 구조 이해를 필요로 하며, 42%만 순수 텍스트로 답 가능 — 멀티모달 이해가 필수

How to Apply

RAG 에이전트에서 검색 실패 후 재시도 로직을 구현할 때, 이전 쿼리와 코사인 유사도가 낮은(drift가 큰) 새 쿼리를 생성하도록 강제하면 성능이 올라간다 — 단순 rephrasing이 아니라 완전히 다른 각도로 재검색해야 함
RLM처럼 제약 없이 전체 문서를 프로그래매틱하게 탐색하는 방식보다, BM25/벡터 검색 도구를 주고 최대 스텝을 10으로 제한하는 constrained agent 방식이 비용 대비 성능이 훨씬 좋다 — 도구 제약이 곧 효율
멀티홉 질문 평가 시 Doc F1(문서 수준)뿐 아니라 Page F1(페이지 수준)도 따로 측정해야 한다 — Doc F1은 높은데 Page F1이 낮으면 '맞는 문서는 찾았지만 정확한 페이지를 못 찾는' last-mile 실패를 놓치게 됨

Code Example

snippet

# BM25 MLLM Agent 핵심 루프 (논문 Algorithm 1 기반 Python 슈도코드)

from whoosh import index, qparser
from PIL import Image
import base64

SYSTEM_PROMPT = """
You are a document QA assistant with access to a search tool.
The answer is definitely in the documents.
If search returns no results, try different terms (synonyms, abbreviations, rephrasing).

Once relevant pages are found, provide:
1. answer: List of short answer values (exact document words preferred)
2. citations: List of {file, page} dicts
"""

def bm25_agent(question: str, search_index, vlm_client, max_steps=10, top_k=5):
    messages = [{"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": question}]
    
    tools = [{
        "name": "search_documents",
        "description": "Search document collection. Supports boolean ops (AND/OR/NOT), phrases in quotes, wildcards (*). Example: '\"annual report\" AND revenue'",
        "parameters": {"query": {"type": "string"}}
    }]
    
    for step in range(max_steps):
        force_answer = (step == max_steps - 1)
        
        response = vlm_client.chat(
            messages=messages,
            tools=None if force_answer else tools,
            # force structured answer output on last step
        )
        
        if response.type == "answer":
            return response.answer, response.citations
        
        elif response.type == "tool_call" and response.tool == "search_documents":
            query = response.args["query"]
            
            # BM25 검색 후 페이지 이미지로 변환
            results = search_index.search(query, limit=top_k)  # (file, page) 튜플 반환
            page_images = [render_page_as_image(file, page) for file, page in results]
            
            # 이미지를 메시지에 추가 (VLM이 시각적으로 분석)
            tool_result = {"role": "tool", "content": [
                {"type": "image", "data": img_to_base64(img)} for img in page_images
            ]}
            messages.append(tool_result)
    
    return fallback_answer(messages)

# 핵심 포인트: 검색 실패 시 과감하게 다른 쿼리 사용
# Claude Sonnet 4.5의 평균 쿼리 드리프트 0.38이 성능의 핵심
# GPT-4.1 Nano처럼 비슷한 쿼리만 반복하면(drift 0.10) 성능 저하

Terminology

MLLMMultimodal Large Language Model. 텍스트뿐 아니라 이미지, 표, 레이아웃도 이해하는 대형 언어 모델. GPT-4o나 Gemini처럼 PDF 페이지 이미지를 보고 답변하는 모델.

BM25키워드 기반 전통 검색 알고리즘. 구글 이전 시대 검색 엔진처럼 단어 빈도수와 문서 길이를 고려해 관련 문서를 찾음. 벡터 검색보다 빠르고 예측 가능하지만 의미 유사성은 약함.

Multi-hop여러 문서나 페이지를 넘나들며 정보를 조합해야 답을 구할 수 있는 질문 유형. 예: '2018년 보고서와 2019년 보고서를 모두 봐야 합계를 낼 수 있는 문제'

Kuiper statistic에이전트가 검색 노력(스텝 수)을 얼마나 효율적으로 쓰는지 측정하는 지표. 낮을수록 좋음 — 쉬운 문제엔 적게, 어려운 문제엔 많이 쓰는 '합리적 배분'을 잘 한다는 뜻.

RLMRecursive Language Model. LLM이 코드를 짜서 문서 전체를 재귀적으로 분석하는 방식. 이론상 유연하지만 실제로는 비용이 폭발적으로 늘어남.

Oracle Gap완벽한 검색 도구를 줬을 때와 실제 검색을 했을 때의 정확도 차이. MADQA에서 약 18% — 검색 자체가 현재 AI의 가장 큰 병목임을 의미.

Page F1에이전트가 인용한 페이지와 실제 정답 페이지가 얼마나 일치하는지 측정하는 점수. Doc F1(문서 수준)보다 엄격한 평가 지표로, last-mile 탐색 실패를 잡아낸다.

Classical Test Theory심리측정학에서 쓰는 시험 문항 분석 방법. 각 문제의 난이도와 변별력을 측정해 '모델 간 실력 차이를 가장 잘 구분하는' 문제들로 테스트셋을 구성하는 데 사용.

Related Resources

Original Abstract (Expand)

Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.