Gorilla: Large Language Model Connected with Massive APIs
TL;DR Highlight
An open-source LLM that's better at API calling than GPT-4 — with much lower hallucination and accurate code generation.
Who Should Read
Backend/ML engineers implementing auto-calling of external APIs or libraries with LLMs. Especially developers connecting HuggingFace, TorchHub, or TensorFlow Hub models to LLM agents.
Core Mechanics
- LLaMA-7B fine-tuned as Gorilla outperforms GPT-4 on API call accuracy and has far fewer hallucinations (making up non-existent APIs)
- Releases APIBench — a new benchmark with 1,645 APIs total (HuggingFace: 925, TorchHub: 94, TensorFlow Hub: 696) — 16,450 instruction-API pairs
- Retriever-Aware Training: training with API documentation in the prompt allows automatic adaptation at inference when documentation changes
- Without a good retriever, zero-shot fine-tuning is better — adding BM25 retriever can actually hurt performance 21-47% in some cases
- Can understand constraint conditions — handles complex queries like 'model with <10M parameters AND >70% ImageNet accuracy'
- GPT-4 shows severe hallucination on HuggingFace (uses non-existent GitHub repo names as model names) — Gorilla dramatically reduces this
Evidence
- Zero-shot: Gorilla vs GPT-4 +20.43% on TorchHub, +10.75% vs ChatGPT, +83% vs LLaMA accuracy
- Gorilla zero-shot hallucination rate: TorchHub 6.98%, HuggingFace 10.95%, TensorFlow Hub 5.40% — overwhelming vs GPT-4 (36.55%, 37.16%, 78.65%)
- Fine-tuning with oracle retriever: +12.37% TorchHub, +23.46% HuggingFace vs training without retriever
- Gorilla + Oracle retriever: HuggingFace 91.26%, TensorFlow Hub 94.16% accuracy
How to Apply
- Build your API documentation DB (JSON format) and at query time, fetch relevant docs with a retriever and pass to Gorilla as 'Use this API documentation for reference: {doc}' — handles latest API changes automatically
- Without a good retriever, Gorilla zero-shot outperforms GPT-4 — if retriever quality is low, it might be better not to use one at all, so measure accuracy before adopting a retriever
- For Gorilla-style fine-tuning on your domain APIs (REST API etc.): structure API docs as JSON → auto-generate instructions with GPT-4/LLaMA (Self-Instruct) → fine-tune LLaMA-7B — the same pipeline applies directly
Code Example
# Gorilla-style API call prompt example
# [Zero-shot approach without Retriever]
prompt = """
### User: I want to classify objects in an image using PyTorch.
Write a Python program to call the appropriate API from TorchHub.
### Assistant:
"""
# [Approach with Retriever] - Fetch API documentation via retriever and append to prompt
retrieved_api_doc = {
"domain": "Object Detection",
"framework": "PyTorch",
"api_name": "fasterrcnn_resnet50_fpn",
"api_call": "torch.hub.load('pytorch/vision', 'fasterrcnn_resnet50_fpn', pretrained=True)",
"api_arguments": {"repo_or_dir": "pytorch/vision", "model": "fasterrcnn_resnet50_fpn", "pretrained": True}
}
prompt_with_retrieval = f"""
### User: I want to detect objects in an image using PyTorch.
Write a Python program to call the appropriate API from TorchHub.
Use this API documentation for reference: {retrieved_api_doc}
### Assistant:
"""
# Load Gorilla model (from HuggingFace)
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("gorilla-llm/gorilla-7b-hf-v1")
model = AutoModelForCausalLM.from_pretrained("gorilla-llm/gorilla-7b-hf-v1")
inputs = tokenizer(prompt_with_retrieval, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))Terminology
Related Resources
Original Abstract (Expand)
Large Language Models (LLMs) have seen an impressive wave of advances recently, with models now excelling in a variety of tasks, such as mathematical reasoning and program synthesis. However, their potential to effectively use tools via API calls remains unfulfilled. This is a challenging task even for today's state-of-the-art LLMs such as GPT-4, largely due to their inability to generate accurate input arguments and their tendency to hallucinate the wrong usage of an API call. We release Gorilla, a finetuned LLaMA-based model that surpasses the performance of GPT-4 on writing API calls. When combined with a document retriever, Gorilla demonstrates a strong capability to adapt to test-time document changes, enabling flexible user updates or version changes. It also substantially mitigates the issue of hallucination, commonly encountered when prompting LLMs directly. To evaluate the model's ability, we introduce APIBench, a comprehensive dataset consisting of HuggingFace, TorchHub, and TensorHub APIs. The successful integration of the retrieval system with Gorilla demonstrates the potential for LLMs to use tools more accurately, keep up with frequently updated documentation, and consequently increase the reliability and applicability of their outputs. Gorilla's code, model, data, and demo are available at https://gorilla.cs.berkeley.edu