AMEL: 대화 히스토리가 LLM 판단에 미치는 누적 편향 효과

TL;DR Highlight

LLM을 자동 평가자로 쓸 때 이전 대화 기록의 긍정/부정 분위기가 이후 판단을 오염시킨다는 걸 75,898개 API 호출로 증명한 연구.

Who Should Read

LLM을 코드 리뷰, 콘텐츠 모더레이션, 자동 채점 등 배치 평가 파이프라인에 투입하는 ML 엔지니어나 백엔드 개발자. 특히 하나의 대화 세션에서 여러 아이템을 평가하도록 구성한 시스템을 운영 중인 사람.

Core Mechanics

한 대화에서 이전 평가들이 주로 '거절'이었으면 다음 평가도 '거절' 쪽으로 쏠리는 현상(AMEL)이 OpenAI, Anthropic, Google, 오픈소스 모델 4개 패밀리 총 11개 모델 전부에서 발견됨. 9/11 모델이 Bonferroni 보정 후에도 유의미.
GPT-4.1 Nano, GPT-5.2, Claude Haiku/Sonnet/Opus, Gemini Flash/Pro, Llama 3.2 3B, Qwen3 시리즈를 다 테스트했는데 GPT-5.2와 Opus 4.6 같은 최신 대형 모델도 d=-0.17로 면역이 없음. 큰 모델이 덜 취약하지만 0이 아님.
모델이 판단을 확신하지 못하는 애매한 항목(baseline entropy 높은 것)에서 편향이 2배 더 강하게 나타남 (확신 없는 항목 d=-0.34 vs 확신 있는 항목 d=-0.15). 가장 공정한 판단이 필요한 경계선 케이스에서 오히려 가장 심하게 오염됨.
부정적 히스토리가 긍정적 히스토리보다 1.62배 더 강한 편향을 유발 (paired t-test, p<10⁻³⁹). 심지어 50/50 중립 히스토리조차 긍정 히스토리보다 부정 방향 편향이 강하게 나옴 — 어떤 히스토리든 있으면 '거절' 쪽으로 기운다.
편향은 고작 5턴으로 포화 상태에 달하고 50턴까지 늘려도 더 심해지지 않음 (Spearman |r|<0.01, p>0.94). 더 많은 컨텍스트 = 더 많은 편향이 아님. 5턴만 있어도 이미 최대치.
temperature를 낮춰도 AMEL이 줄지 않음 — 오히려 강해지는 경향 (T=1.0: d=-0.28 vs T=0.3: d=-0.69). '온도 낮추면 안정적'이라는 직관이 여기선 틀림.

Evidence

전체 75,898개 API 호출(11개 모델, 4개 프로바이더)에서 전반적 편향 효과 d=-0.17, p<10⁻⁴⁶. 평균 bias score = -0.052.
부정 히스토리 vs 긍정 히스토리 쌍별 비교: |BS| 비율 1.62× (t=13.46, p<10⁻³⁹, n=2,481 pairs). 부정이 훨씬 끈적하게 달라붙음.
5턴과 50턴 편향 크기 차이: Spearman |r|<0.01, OLS slope p=0.80. 실질적으로 차이 없음.
Anthropic 스케일링: Haiku d=-0.22 → Sonnet d=-0.18 → Opus d=-0.17. OpenAI: Nano d=-0.34 → GPT-5.2 d=-0.17. 개선은 되지만 제거는 안 됨.

How to Apply

코드 리뷰 봇이나 콘텐츠 모더레이션 파이프라인에서 여러 항목을 한 conversation에 넣어 평가하고 있다면, 각 항목마다 새 대화(fresh context)를 시작하도록 변경하라. 비용이 늘지만 편향 채널 자체를 차단하는 유일한 완벽한 해법.
비용 때문에 배치 처리가 불가피하다면, 아이템 순서를 섞고 긍정/부정 예상 결과를 균형 있게 배치하라. '좋은 것 먼저, 나쁜 것 나중'처럼 정렬하면 자기강화 편향 루프가 생기므로 절대 금지.
first-token logprobs에서 non-deterministic(entropy > 0)한 경계선 아이템을 감지해 이것들에만 fresh context를 적용하는 두 단계 전략도 유효. B-SCORE나 logprobs 기반 보정 라이브러리(RBCorr)를 런타임 교정기로 같이 쓰면 배치 환경에서도 부분 보완 가능.

Code Example

snippet

# AMEL 방지 패턴: 각 평가 항목에 fresh context 사용
import openai

client = openai.OpenAI()

def evaluate_item_fresh(item: str, system_prompt: str) -> str:
    """
    AMEL 방지: 각 아이템마다 새 대화 시작.
    절대 messages 리스트를 재사용하지 말 것.
    """
    response = client.chat.completions.create(
        model="gpt-4.1-nano",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": item}
        ],
        max_tokens=100
    )
    return response.choices[0].message.content

# 나쁜 패턴 (AMEL 발생)
def bad_batch_evaluate(items: list[str], system_prompt: str) -> list[str]:
    messages = [{"role": "system", "content": system_prompt}]
    results = []
    for item in items:  # 히스토리가 쌓여서 편향 발생!
        messages.append({"role": "user", "content": item})
        resp = client.chat.completions.create(model="gpt-4.1-nano", messages=messages)
        answer = resp.choices[0].message.content
        messages.append({"role": "assistant", "content": answer})
        results.append(answer)
    return results

# 좋은 패턴 (AMEL 방지)
def good_batch_evaluate(items: list[str], system_prompt: str) -> list[str]:
    import random
    shuffled = items.copy()
    random.shuffle(shuffled)  # 순서도 섞기
    return [evaluate_item_fresh(item, system_prompt) for item in shuffled]

Terminology

AMEL한 대화에서 이전 평가들의 분위기(긍정/부정)가 이후 평가를 오염시키는 현상. 코드 리뷰어가 PR 10개를 연속 거절하다 보면 11번째도 더 엄격하게 보게 되는 것과 같음.

Cohen's d효과 크기를 나타내는 통계 지표. d=0.2는 작은 효과, d=0.5는 중간, d=0.8은 큰 효과로 봄. 이 논문에서 d=-0.17은 '작지만 무시 못 할' 수준.

Bonferroni correction여러 통계 검정을 동시에 할 때 우연히 유의미하게 나올 확률을 보정하는 방법. 21개 검정을 하면 유의 기준을 21배 더 엄격하게 적용.

sycophancyLLM이 사용자의 의견에 동조하는 현상. AMEL은 이것과 다름 — 유저 의견이 없어도 모델이 자기 이전 답변 패턴에 끌려가는 것.

logprobs모델이 다음 토큰을 선택할 때 각 후보 토큰에 부여하는 확률의 로그값. 'yes'와 'no' 중 어느 쪽을 더 확신하는지 수치로 볼 수 있음.

baseline entropy모델이 어떤 항목에 대해 얼마나 확신이 없는지를 나타내는 지표. entropy=0이면 항상 같은 답, entropy=1이면 50/50으로 불확실한 것.

RLHF인간 피드백 기반 강화학습(Reinforcement Learning from Human Feedback). 사람이 좋다고 평가한 답변을 더 자주 생성하도록 모델을 추가 학습시키는 기법. 이 논문에선 RLHF가 '거절' 패턴을 강화해 negativity asymmetry를 만들 수 있다고 봄.

fresh context이전 대화 기록이 전혀 없는 깨끗한 새 대화 세션. AMEL을 막는 가장 확실한 방법으로, 각 평가 항목마다 새 대화를 시작하는 것.

Related Resources

AMEL GitHub Repository (코드, 데이터, 분석 스크립트)

Original Abstract (Expand)

Large language models are routinely used as automated evaluators: to review code, moderate content, or score outputs, often with many items passing through one conversation. We ask whether the polarity of prior conversation history biases subsequent judgments, an effect we call the accumulated message effect on LLM judgments (AMEL). Across 75,898 API calls to 11 models from 4 providers (OpenAI, Anthropic, Google, and four open-source models), we present identical test items in isolation or following histories saturated with predominantly positive or negative evaluations. Models shift toward the conversation's prevailing polarity (d = -0.17, p < 10^-46). The effect concentrates on items where the model is genuinely uncertain at baseline (d = -0.34 for high-entropy items, vs d = -0.15 when the baseline is deterministic). Bias does not grow with context length: 5 prior turns and 50 produce the same shift (Spearman |r| < 0.01; OLS slope p = 0.80). And there is a negativity asymmetry: paired per item, negative histories induce 1.62x more bias than positive (t = 13.46, p < 10^-39, n = 2,481). Scaling helps but does not solve it (Anthropic: Haiku -0.22 to Opus -0.17; OpenAI: Nano -0.34 to GPT-5.2 -0.17). Three follow-ups narrow the mechanism. The token probability distribution shifts continuously, not at a threshold. The negativity asymmetry has both token-level and semantic components, though attributing the balance is exploratory at our sample sizes. Position does not matter: five biased turns anywhere in a 50-turn history produce the same shift. The simplest fix for evaluation pipelines is a fresh context per item; when batching is unavoidable, balancing the history helps.