LLM이 곧 Human-Level Prompt Engineer: Automatic Prompt Engineer (APE)

Large Language Models Are Human-Level Prompt Engineers

Nov 3, 2022•Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han +4•View PDF

TL;DR Highlight

LLM에게 입출력 예시만 주면 알아서 최적의 프롬프트를 생성·선택해주는 APE 알고리즘으로 24개 태스크 전부에서 인간 수준 이상 달성.

Who Should Read

매번 프롬프트를 손으로 짜느라 시간 낭비하는 ML 엔지니어나 LLM 앱 개발자. 특히 특정 태스크의 system prompt를 최적화하고 싶은 상황에 바로 적용 가능.

Core Mechanics

입출력 예시 몇 개만 주면 LLM이 스스로 수십~수백 개 프롬프트 후보를 생성하고, 점수가 높은 것을 골라주는 APE(Automatic Prompt Engineer) 파이프라인 제안
InstructGPT(text-davinci-002) 기준으로 24개 Instruction Induction 태스크 전부(24/24)에서 인간이 만든 프롬프트와 동등하거나 그 이상 성능 달성 (IQM 0.810 vs 인간 0.749)
Zero-Shot Chain-of-Thought 프롬프트도 자동 최적화 가능 — APE가 찾아낸 'Let's work this out in a step by step way to be sure we have the right answer.'가 기존 'Let's think step by step.'보다 MultiArith에서 78.7→82.0, GSM8K에서 40.7→43.0으로 성능 향상
생성된 프롬프트를 few-shot in-context learning 앞에 붙이면 24개 중 21개 태스크에서 성능이 같거나 올라감 — 프롬프트가 토큰 효율도 최대 5배 좋음
TruthfulQA에서 APE로 찾은 프롬프트가 truthfulness+informativeness 동시 달성률을 인간 프롬프트(30%) 대비 40%이상으로 끌어올림
프롬프트는 생성한 모델에 최적화됨 — InstructGPT로 만든 프롬프트를 GPT-3에 쓰면 성능이 확 떨어짐. 생성 모델 = 실행 모델이어야 잘 작동

Evidence

24/24 Instruction Induction 태스크에서 APE(InstructGPT) IQM 0.810으로 인간 0.749 초과
BIG-Bench 21개 태스크 중 17개에서 인간 작성 프롬프트와 동등하거나 더 높은 zero-shot 성능
MultiArith: 기존 CoT 78.7 → APE CoT 82.0 / GSM8K: 40.7 → 43.0
64개 프롬프트 후보만 샘플링해도 인간 수준 성능 달성, 샘플 수가 늘수록 성능은 단조증가

How to Apply

특정 태스크의 system prompt를 최적화하고 싶을 때: 입출력 예시 5~10개를 준비하고, GPT-4나 InstructGPT에게 'The instruction was <COMPLETE>' 템플릿으로 프롬프트 후보 50개 생성 → 각 후보를 validation set에서 실행해 정확도로 랭킹 → 1위 프롬프트 채택
Chain-of-Thought 프롬프트를 개선하고 싶을 때: 'Let's think step by step.'으로 정답이 나온 문제들만 필터링해 CoT 데이터셋 만들고, APE로 'Let's'로 시작하는 다양한 변형을 탐색해 최고 성능 프리픽스를 찾아냄
프롬프트 후보가 부족하면 Iterative Monte Carlo Search 적용: 높은 점수를 받은 프롬프트를 'Generate a variation of the following instruction while keeping the semantic meaning.' 템플릿으로 변형·재샘플링하는 루프를 3~5번 돌림

Code Example

snippet

# APE 핵심 플로우 — Python 슈도코드
import openai

def generate_prompt_candidates(demos, n=50):
    """
    demos: [(input, output), ...] 형태의 입출력 예시
    """
    demo_str = "\n".join([f"Input: {i}\nOutput: {o}" for i, o in demos])
    meta_prompt = f"""I gave a friend an instruction and five inputs.
 The friend read the instruction and wrote an output for every one of the inputs.
 Here are the input-output pairs:
{demo_str}
The instruction was"""
    
    candidates = []
    for _ in range(n):
        resp = openai.Completion.create(
            model="text-davinci-002",
            prompt=meta_prompt,
            max_tokens=50,
            temperature=0.9
        )
        candidates.append(resp.choices[0].text.strip())
    return candidates

def score_candidate(instruction, val_demos, model="text-davinci-002"):
    """실행 정확도로 프롬프트 품질 평가"""
    correct = 0
    for q, a in val_demos:
        prompt = f"Instruction: {instruction}\nInput: {q}\nOutput:"
        resp = openai.Completion.create(model=model, prompt=prompt, max_tokens=20, temperature=0)
        pred = resp.choices[0].text.strip()
        if pred == a:
            correct += 1
    return correct / len(val_demos)

# 메인 APE 루프
train_demos = [("cat", "c"), ("dog", "d"), ("apple", "a")]  # 예시
val_demos = [("banana", "b"), ("orange", "o")]

candidates = generate_prompt_candidates(train_demos, n=50)
scores = [(c, score_candidate(c, val_demos)) for c in candidates]
best_prompt = max(scores, key=lambda x: x[1])
print(f"Best prompt: {best_prompt[0]} (score: {best_prompt[1]:.2f})")

# Zero-shot CoT 최적화용 템플릿
cot_meta_prompt = """
Instruction: Answer the following question.
Q: {question}
A: Let's <INSERT>. {reasoning}
"""

Terminology

APEAutomatic Prompt Engineer의 약자. 사람이 직접 프롬프트를 짜는 대신 LLM이 알아서 후보 프롬프트를 생성하고 좋은 것을 골라주는 자동화 시스템.

Instruction Induction입출력 예시 몇 개만 보고 그 태스크를 설명하는 자연어 지시문(instruction)을 자동으로 추론해내는 것. 예: (cat→c, dog→d) 예시를 보고 '단어의 첫 글자를 써라'라는 지시문을 찾아내는 것.

Zero-Shot Chain-of-Thought (Zero-Shot CoT)예시 없이 '단계별로 생각해봐' 같은 한 줄만 붙여도 LLM이 중간 추론 과정을 보여주며 복잡한 문제를 푸는 기법. 파인튜닝 없이 프롬프트만으로 추론 능력을 끌어냄.

IQM (Interquartile Mean)성능 점수의 상하위 25%를 잘라내고 나머지 중간값들의 평균. 극단적인 이상치에 영향을 덜 받는 안정적인 성능 지표.

Execution Accuracy생성된 프롬프트를 실제로 모델에 실행해서 출력이 정답과 일치하는 비율. APE에서 어떤 프롬프트가 더 좋은지 판단하는 점수 기준.

Black-box Optimization내부 구조를 모르는 채로 입력-출력만 보면서 최적값을 찾는 방법. APE에서는 LLM의 내부 가중치를 건드리지 않고 프롬프트 텍스트만 조작해 최적 지시문을 탐색.

Iterative Monte Carlo Search현재 좋은 후보들을 조금씩 변형해서 더 나은 후보를 만드는 반복 탐색법. 등산할 때 현재 위치 근처를 여러 방향으로 조금씩 이동해보며 더 높은 곳을 찾는 것과 비슷.

Related Resources

APE GitHub 코드 저장소

Original Abstract (Expand)

By conditioning on natural language instructions, large language models (LLMs) have displayed impressive capabilities as general-purpose computers. However, task performance depends significantly on the quality of the prompt used to steer the model, and most effective prompts have been handcrafted by humans. Inspired by classical program synthesis and the human approach to prompt engineering, we propose Automatic Prompt Engineer (APE) for automatic instruction generation and selection. In our method, we treat the instruction as the"program,"optimized by searching over a pool of instruction candidates proposed by an LLM in order to maximize a chosen score function. To evaluate the quality of the selected instruction, we evaluate the zero-shot performance of another LLM following the selected instruction. Experiments on 24 NLP tasks show that our automatically generated instructions outperform the prior LLM baseline by a large margin and achieve better or comparable performance to the instructions generated by human annotators on 19/24 tasks. We conduct extensive qualitative and quantitative analyses to explore the performance of APE. We show that APE-engineered prompts can be applied to steer models toward truthfulness and/or informativeness, as well as to improve few-shot learning performance by simply prepending them to standard in-context learning prompts. Please check out our webpage at https://sites.google.com/view/automatic-prompt-engineer.