Gorilla: 대규모 API와 연결된 Large Language Model

Gorilla: Large Language Model Connected with Massive APIs

May 24, 2023•Shishir G. Patil, Tianjun Zhang, Xin Wang +1•View PDF

TL;DR Highlight

GPT-4보다 API 호출을 더 잘하는 오픈소스 LLM — hallucination 거의 없이 정확한 코드 생성.

Who Should Read

LLM으로 외부 API나 라이브러리를 자동 호출하는 기능을 구현 중인 백엔드/ML 엔지니어. 특히 HuggingFace, TorchHub, TensorFlow Hub 모델을 LLM 에이전트에 연결하려는 개발자.

Core Mechanics

LLaMA-7B를 fine-tuning한 Gorilla가 GPT-4보다 API 호출 정확도에서 앞서고, hallucination(없는 API 만들어내기)도 훨씬 적음
APIBench라는 새 벤치마크 공개 — HuggingFace(925개), TorchHub(94개), TensorFlow Hub(696개) API로 구성된 총 1,645개 API, 16,450개 instruction-API 쌍
Retriever-Aware Training: 훈련 시 API 문서를 프롬프트에 같이 넣어 학습시키면, 추론 시 문서가 바뀌어도 자동으로 적응 가능
좋은 retriever 없으면 zero-shot fine-tuning이 더 나음 — BM25 retriever 붙이면 오히려 성능 21~47% 하락하는 경우도 있음
제약 조건 이해 가능 — '파라미터 10M 미만에 ImageNet 정확도 70% 이상인 모델' 같은 복합 조건도 처리
GPT-4가 HuggingFace에서 심각한 hallucination 발생 (없는 GitHub repo 이름을 모델명으로 사용) — Gorilla는 이를 대폭 감소

Evidence

Zero-shot 기준 Gorilla가 GPT-4 대비 TorchHub +20.43%, ChatGPT 대비 +10.75%, LLaMA 대비 +83% 정확도 향상
Gorilla zero-shot hallucination율: TorchHub 6.98%, HuggingFace 10.95%, TensorFlow Hub 5.40% — GPT-4(36.55%, 37.16%, 78.65%) 대비 압도적으로 낮음
Oracle retriever와 함께 fine-tuning하면 retriever 없는 학습 대비 TorchHub +12.37%, HuggingFace +23.46% 성능 향상
Gorilla + Oracle retriever 조합: HuggingFace 91.26%, TensorFlow Hub 94.16% 정확도 달성

How to Apply

자체 API 문서(JSON 형태)를 DB로 구축하고, 사용자 쿼리 시 관련 문서를 retriever로 가져와 프롬프트에 'Use this API documentation for reference: {doc}'로 붙여서 Gorilla에 전달하면 최신 API 변경사항에도 대응 가능
좋은 retriever가 없는 상황이라면 Gorilla zero-shot으로 써도 GPT-4 이상 성능 — retriever 품질이 낮으면 오히려 안 쓰는 게 나을 수 있으니, retriever 도입 전 정확도 먼저 측정해볼 것
자체 도메인 API(REST API 등)용 Gorilla 스타일 파인튜닝을 하려면: API 문서를 JSON으로 구조화 → GPT-4/LLaMA로 instruction 자동 생성(Self-Instruct) → LLaMA-7B fine-tuning하는 파이프라인을 그대로 적용 가능

Code Example

snippet

# Gorilla 스타일 API 호출 프롬프트 예시

# [Retriever 없는 Zero-shot 방식]
prompt = """
### User: I want to classify objects in an image using PyTorch.
Write a Python program to call the appropriate API from TorchHub.
### Assistant:
"""

# [Retriever 있는 방식] - API 문서를 retriever로 가져와 프롬프트에 붙임
retrieved_api_doc = {
    "domain": "Object Detection",
    "framework": "PyTorch",
    "api_name": "fasterrcnn_resnet50_fpn",
    "api_call": "torch.hub.load('pytorch/vision', 'fasterrcnn_resnet50_fpn', pretrained=True)",
    "api_arguments": {"repo_or_dir": "pytorch/vision", "model": "fasterrcnn_resnet50_fpn", "pretrained": True}
}

prompt_with_retrieval = f"""
### User: I want to detect objects in an image using PyTorch.
Write a Python program to call the appropriate API from TorchHub.
Use this API documentation for reference: {retrieved_api_doc}
### Assistant:
"""

# Gorilla 모델 로드 (HuggingFace에서)
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gorilla-llm/gorilla-7b-hf-v1")
model = AutoModelForCausalLM.from_pretrained("gorilla-llm/gorilla-7b-hf-v1")

inputs = tokenizer(prompt_with_retrieval, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Terminology

HallucinationLLM이 없는 API나 함수를 있는 것처럼 만들어내는 현상. 존재하지 않는 GitHub 저장소 이름을 모델명으로 쓰는 게 대표적 예시.

Self-Instruct모델(GPT-4 등)이 스스로 훈련 데이터(instruction-answer 쌍)를 생성하는 기법. 사람이 일일이 라벨링하는 비용을 줄임.

AST Sub-Tree Matching코드를 트리 구조(AST, Abstract Syntax Tree)로 파싱해서 API 호출이 올바른지 검사하는 방법. 단위 테스트 대신 구조적으로 정확성을 비교.

Retriever-Aware Training훈련할 때부터 retriever가 가져온 문서를 프롬프트에 포함시켜 학습시키는 방법. 덕분에 추론 시 API 문서가 바뀌어도 유연하게 대응 가능.

BM25키워드 기반의 고전적인 문서 검색 알고리즘. TF-IDF 개선 버전으로 단어 빈도와 문서 길이를 고려해 관련 문서를 찾음.

Zero-shot추가 예시나 힌트 없이 모델에게 바로 질문하는 방식. 파인튜닝된 Gorilla는 zero-shot에서도 GPT-4보다 높은 성능을 보임.

APIBench이 논문에서 새로 만든 API 호출 평가 데이터셋. HuggingFace, TorchHub, TensorFlow Hub 총 1,645개 API와 16,450개 instruction-API 쌍으로 구성.

Related Resources

Gorilla 공식 프로젝트 페이지 (코드, 모델, 데이터, 데모)

Original Abstract (Expand)

Large Language Models (LLMs) have seen an impressive wave of advances recently, with models now excelling in a variety of tasks, such as mathematical reasoning and program synthesis. However, their potential to effectively use tools via API calls remains unfulfilled. This is a challenging task even for today's state-of-the-art LLMs such as GPT-4, largely due to their inability to generate accurate input arguments and their tendency to hallucinate the wrong usage of an API call. We release Gorilla, a finetuned LLaMA-based model that surpasses the performance of GPT-4 on writing API calls. When combined with a document retriever, Gorilla demonstrates a strong capability to adapt to test-time document changes, enabling flexible user updates or version changes. It also substantially mitigates the issue of hallucination, commonly encountered when prompting LLMs directly. To evaluate the model's ability, we introduce APIBench, a comprehensive dataset consisting of HuggingFace, TorchHub, and TensorHub APIs. The successful integration of the retrieval system with Gorilla demonstrates the potential for LLMs to use tools more accurately, keep up with frequently updated documentation, and consequently increase the reliability and applicability of their outputs. Gorilla's code, model, data, and demo are available at https://gorilla.cs.berkeley.edu