Crowd-sourced Human Feedback을 Bayesian Inference로 정렬하는 코드 생성 LLM의 RLHF 프레임워크 (cRLHF)

Aligning Crowd-sourced Human Feedback for Reinforcement Learning on Code Generation by Large Language Models

Mar 19, 2025•M. Wong, C. Tan•View PDF

TL;DR Highlight

별도 Reward Model 없이 여러 사람의 코드 라인 평가를 Bayesian 방식으로 합산해 LLM 코드 생성 품질을 올리는 cRLHF 프레임워크.

Who Should Read

코드 생성 모델에 RLHF를 적용하려는 ML 엔지니어 중, Reward Model 학습 비용 없이 크라우드소싱 피드백을 활용하고 싶은 팀. 특히 GitHub Copilot 류 툴을 내재화하거나 도메인 특화 코드 LLM을 파인튜닝하는 상황에 참고할 수 있다.

Core Mechanics

cRLHF: 여러 annotator가 코드 각 라인을 correct/wrong으로 평가하면, Bayesian inference로 합산해 reward score를 자동 계산 — 별도 Reward Model 학습 단계가 사라짐
Honeypot 질문(정답 미리 아는 문제)으로 각 annotator의 신뢰도 pi 값을 자동 측정해, 부정확한 피드백의 가중치를 낮추는 품질 필터링 내장
WizardCoder-15B, StarCoder2-15B, CodeLlama-13B/7B, DeepSeek-Coder-6.7B 등 17개 모델(최대 15B 파라미터)에 TRL + LoRA + PPO로 파인튜닝 적용
대형 모델일수록 cRLHF 효과가 더 뚜렷하고, 소형 모델(PolyCoder-160M 등)에서는 변화가 미미함
L1 정규화(스파스 정규화)를 써서 노이즈 많은 annotator를 자동으로 걸러내 과적합 방지
annotator 신뢰도 업데이트를 반복 방식 대신 정규화 로지스틱 회귀 최적화로도 처리 가능(one-shot 글로벌 추정)

Evidence

HumanEval 벤치마크 기준 17개 모델 평균: Pass@1 +0.2%, Pass@10 +0.3%, Pass@100 +1.2%
MBPP 벤치마크 기준 평균: Pass@1 +0.2%, Pass@10 +0.6%, Pass@100 +0.6%
StarCoder2-15B MBPP Pass@100: 84.2% → 84.6%, Pass@10: 43.3% → 44.1%
HumanEval에서 17개 모델 중 12개가 Pass@10 개선, MBPP에서 10개 모델이 Pass@10/Pass@100 동시 개선

How to Apply

코드 annotation 플랫폼 구축 시, 처음 N개 문제를 Honeypot(정답 공개 안 한 채 정답 알고 있는 문제)으로 설정해 annotator pi 값을 초기화하면 품질 낮은 참여자를 자동 필터링할 수 있다
TRL 라이브러리의 PPO Trainer + LoRA 조합으로 기존 코드 LLM을 파인튜닝할 때, reward model 대신 cRLHF의 aligned score(s = 정답 라인 수 / 전체 라인 수)를 reward 값으로 직접 주입할 수 있다
내부 코드 리뷰 데이터나 팀원 평가 결과가 있다면, 동일한 Bayesian aggregation 로직으로 여러 리뷰어 의견을 단일 reward 점수로 변환해 파인튜닝에 재활용할 수 있다

Code Example

snippet

import numpy as np

def logit(p):
    return np.log(p / (1 - p))

def logit_inv(x):
    return np.exp(x) / (1 + np.exp(x))

def bayesian_aggregate(annotations, pi_values):
    """
    annotations: list of +1 (correct) or -1 (wrong) from each annotator
    pi_values:   list of each annotator's reliability score (0~1)
    returns:     P(line is correct | all annotations)
    """
    score = sum(eps * logit(pi) for eps, pi in zip(annotations, pi_values))
    return logit_inv(score)

def update_pi(pi, correctness, p_bar=1.0, lam=1.0):
    """
    pi:          current annotator reliability
    correctness: mu = +1 if annotation was right, -1 if wrong
    p_bar:       system confidence in ground truth (1.0 for honeypot)
    """
    return logit_inv(logit(pi) + lam * correctness * logit(p_bar))

# 예시: 3명 annotator가 코드 한 줄을 평가
annotations = [1, 1, -1]       # annotator 1,2는 correct, 3은 wrong
pi_values   = [0.8, 0.7, 0.3]  # 각 annotator 신뢰도

prob_correct = bayesian_aggregate(annotations, pi_values)
print(f"이 라인이 정답일 확률: {prob_correct:.3f}")

# reward score = 정답 라인 수 / 전체 라인 수
# 이 값을 PPO reward로 직접 사용

Terminology

RLHF사람이 '이게 더 낫다'고 평가한 결과를 보상 신호로 삼아 LLM을 추가 학습시키는 기법. ChatGPT가 사람 말투에 맞게 대답하는 것도 이 방식으로 훈련됨.

PPOProximal Policy Optimization. RL에서 모델 정책을 너무 급격히 바꾸지 않으면서 조금씩 개선하는 학습 알고리즘. 운전 연습할 때 핸들을 조금씩만 꺾는 것과 비슷.

Pass@kLLM이 코드를 k번 생성할 때 그중 하나 이상이 테스트를 통과할 확률. Pass@1은 단 1번 시도, Pass@100은 100번 시도 중 1번이라도 맞추면 성공.

LoRA모델 전체를 재학습하지 않고, 작은 어댑터 행렬만 추가해서 학습하는 경량 파인튜닝 기법. GPU 메모리를 훨씬 적게 쓰면서도 비슷한 효과를 냄.

Bayesian inference사전 지식(prior)에 새 증거를 반영해 확률을 업데이트하는 방법. annotator가 틀릴 수도 있다는 불확실성을 수학적으로 처리할 때 유용.

Honeypot question정답을 이미 알고 있는 문제를 참가자에게 몰래 섞어서, 참가자가 얼마나 믿을 만한지 측정하는 기법. 설문에서 '저는 로봇입니다'를 체크하는 함정 문항과 비슷.

SFTSupervised Fine-Tuning. 정답 예시를 보여주고 그대로 따라 하도록 학습시키는 방법. RLHF 전 단계에서 모델의 기초 능력을 먼저 만들 때 씀.

Related Resources

Original Abstract (Expand)

This paper studies how AI-assisted programming and large language models (LLM) improve software developers' ability via AI tools (LLM agents) like Github Copilot and Amazon CodeWhisperer, while integrating human feedback to enhance reinforcement learning (RLHF) with crowd-sourced computation to enhance text-to-code generation. Additionally, we demonstrate that our Bayesian optimization framework supports AI alignment in code generation by distributing the feedback collection burden, highlighting the value of collecting human feedback of good quality. Our empirical evaluations demonstrate the efficacy of this approach, showcasing how LLM agents can be effectively trained for improved text-to-code generation. Our Bayesian optimization framework can be designed for general domain-specific languages, promoting the alignment of large language model capabilities with human feedback in AI-assisted programming for code generation.