Gorilla: Large Language Model Connected with Massive APIs

May 24, 2023•Shishir G. Patil, Tianjun Zhang, Xin Wang +1•View PDF

TL;DR Highlight

An open-source LLM that's better at API calling than GPT-4 — with much lower hallucination and accurate code generation.

Who Should Read

Backend/ML engineers implementing auto-calling of external APIs or libraries with LLMs. Especially developers connecting HuggingFace, TorchHub, or TensorFlow Hub models to LLM agents.

Core Mechanics

LLaMA-7B fine-tuned as Gorilla outperforms GPT-4 on API call accuracy and has far fewer hallucinations (making up non-existent APIs)
Releases APIBench — a new benchmark with 1,645 APIs total (HuggingFace: 925, TorchHub: 94, TensorFlow Hub: 696) — 16,450 instruction-API pairs
Retriever-Aware Training: training with API documentation in the prompt allows automatic adaptation at inference when documentation changes
Without a good retriever, zero-shot fine-tuning is better — adding BM25 retriever can actually hurt performance 21-47% in some cases
Can understand constraint conditions — handles complex queries like 'model with <10M parameters AND >70% ImageNet accuracy'
GPT-4 shows severe hallucination on HuggingFace (uses non-existent GitHub repo names as model names) — Gorilla dramatically reduces this

Evidence

Zero-shot: Gorilla vs GPT-4 +20.43% on TorchHub, +10.75% vs ChatGPT, +83% vs LLaMA accuracy
Gorilla zero-shot hallucination rate: TorchHub 6.98%, HuggingFace 10.95%, TensorFlow Hub 5.40% — overwhelming vs GPT-4 (36.55%, 37.16%, 78.65%)
Fine-tuning with oracle retriever: +12.37% TorchHub, +23.46% HuggingFace vs training without retriever
Gorilla + Oracle retriever: HuggingFace 91.26%, TensorFlow Hub 94.16% accuracy

How to Apply

Build your API documentation DB (JSON format) and at query time, fetch relevant docs with a retriever and pass to Gorilla as 'Use this API documentation for reference: {doc}' — handles latest API changes automatically
Without a good retriever, Gorilla zero-shot outperforms GPT-4 — if retriever quality is low, it might be better not to use one at all, so measure accuracy before adopting a retriever
For Gorilla-style fine-tuning on your domain APIs (REST API etc.): structure API docs as JSON → auto-generate instructions with GPT-4/LLaMA (Self-Instruct) → fine-tune LLaMA-7B — the same pipeline applies directly

Code Example

snippet

# Gorilla-style API call prompt example

# [Zero-shot approach without Retriever]
prompt = """
### User: I want to classify objects in an image using PyTorch.
Write a Python program to call the appropriate API from TorchHub.
### Assistant:
"""

# [Approach with Retriever] - Fetch API documentation via retriever and append to prompt
retrieved_api_doc = {
    "domain": "Object Detection",
    "framework": "PyTorch",
    "api_name": "fasterrcnn_resnet50_fpn",
    "api_call": "torch.hub.load('pytorch/vision', 'fasterrcnn_resnet50_fpn', pretrained=True)",
    "api_arguments": {"repo_or_dir": "pytorch/vision", "model": "fasterrcnn_resnet50_fpn", "pretrained": True}
}

prompt_with_retrieval = f"""
### User: I want to detect objects in an image using PyTorch.
Write a Python program to call the appropriate API from TorchHub.
Use this API documentation for reference: {retrieved_api_doc}
### Assistant:
"""

# Load Gorilla model (from HuggingFace)
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gorilla-llm/gorilla-7b-hf-v1")
model = AutoModelForCausalLM.from_pretrained("gorilla-llm/gorilla-7b-hf-v1")

inputs = tokenizer(prompt_with_retrieval, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Terminology

HallucinationWhen an LLM makes up non-existent APIs or functions as if they exist. Using non-existent GitHub repository names as model names is a prime example.

Self-InstructA technique where a model (GPT-4 etc.) generates its own training data (instruction-answer pairs). Reduces the cost of human labeling.

AST Sub-Tree MatchingParsing code into tree structure (AST, Abstract Syntax Tree) and checking if API calls are correct. Structurally comparing correctness instead of unit testing.

Retriever-Aware TrainingTraining with retriever-fetched documents already included in the prompt. Makes the model automatically adapt at inference even when API documentation changes.

Zero-shot fine-tuningFine-tuning the model without retriever-augmented examples. Better than retriever-based when the retriever quality is poor.

Related Resources

Gorilla Official Project Page (Code, Models, Data, Demo)

Original Abstract (Expand)

Large Language Models (LLMs) have seen an impressive wave of advances recently, with models now excelling in a variety of tasks, such as mathematical reasoning and program synthesis. However, their potential to effectively use tools via API calls remains unfulfilled. This is a challenging task even for today's state-of-the-art LLMs such as GPT-4, largely due to their inability to generate accurate input arguments and their tendency to hallucinate the wrong usage of an API call. We release Gorilla, a finetuned LLaMA-based model that surpasses the performance of GPT-4 on writing API calls. When combined with a document retriever, Gorilla demonstrates a strong capability to adapt to test-time document changes, enabling flexible user updates or version changes. It also substantially mitigates the issue of hallucination, commonly encountered when prompting LLMs directly. To evaluate the model's ability, we introduce APIBench, a comprehensive dataset consisting of HuggingFace, TorchHub, and TensorHub APIs. The successful integration of the retrieval system with Gorilla demonstrates the potential for LLMs to use tools more accurately, keep up with frequently updated documentation, and consequently increase the reliability and applicability of their outputs. Gorilla's code, model, data, and demo are available at https://gorilla.cs.berkeley.edu