Text Classification에서의 Large Language Models: 사례 연구 및 종합 리뷰

Large Language Models For Text Classification: Case Study And Comprehensive Review

Jan 14, 2025•A. Kostina, M. Dikaiakos, Dimosthenis Stefanidis +1•View PDF

TL;DR Highlight

GPT-4, Llama3 등 10개 LLM을 전통 ML 모델과 비교해 텍스트 분류 성능과 속도 트레이드오프를 실측한 벤치마크 연구

Who Should Read

텍스트 분류 파이프라인을 구축 중인 백엔드/ML 엔지니어, 또는 LLM vs 전통 ML 모델 선택 기준을 고민하는 개발자

Core Mechanics

복잡한 3-class 분류에서는 GPT-4-turbo(87.6%)와 Llama3 70B(87.1%)가 RoBERTa(83.8%), SVM(68.7%)을 앞질렀지만 추론 시간은 2500초 vs 15초로 압도적 차이
단순 binary 분류(가짜뉴스 탐지)에서는 RoBERTa가 93.0%로 GPT-4-turbo(83.7%)를 이기고, NB/SVM도 88~90%로 대부분 LLM보다 우월
CoT(Chain-of-Thought, 단계별 추론 유도) 기법이 가장 일관적으로 성능을 올렸고, Few-shot과 조합하면 기본 ZS 대비 최대 22.2% F1 향상
Role-Playing + Naming-the-Assistant 조합은 모델마다 효과가 들쭉날쭉해서 어떤 모델엔 독, 어떤 모델엔 약
AWQ 양자화(모델 가중치를 4-8bit로 압축하는 기법)된 Mistral-OO가 양자화 안 된 표준 Mistral보다 Employee Reviews에서 4.5% 더 높은 성능 — 양자화가 꼭 성능 손실을 의미하지 않음
성능이 낮은 모델일수록 프롬프트 문구 변화에 민감하게 반응 (Llama2는 동일 태스크에서 최고-최저 F1 차이가 42.3%에 달함)

Evidence

3-class 분류: GPT-4-turbo 87.6%, Llama3 70B 87.1%, RoBERTa 83.8%, SVM 68.7% (weighted F1-score)
Binary 분류: RoBERTa 93.0%, Llama3 70B 94.4%, NB 90.0%, SVM 88.8%, GPT-4-turbo 최고 83.7%
추론 시간: GPT-4-turbo ~2500초, RoBERTa 15초, SVM/NB 1초 미만 (Employee Reviews 1000건 기준)
프롬프트 효과: CoT 포함 조합(FS+COT+RP+NA)이 기본 ZS 대비 최대 22.2%p F1 향상 (Employee Reviews, Xwin 모델 기준)

How to Apply

비용/속도가 중요한 binary 분류 태스크(스팸, 감성분석 등)라면 LLM보다 fine-tuned RoBERTa나 SVM이 더 실용적 — 성능 비슷하고 속도는 수백배 빠름
LLM을 써야 하는 복잡한 다중 분류 태스크라면 기본 ZS로 시작하지 말고 ZS+CoT 또는 FS+CoT+RP+NA 조합을 먼저 테스트해볼 것
모델 예산이 제한된 경우 AWQ 양자화 버전 모델을 시도해볼 것 — Mistral-OO 사례처럼 fine-tuning 데이터셋 품질이 좋으면 양자화 후에도 성능이 오히려 오를 수 있음

Code Example

snippet

# 논문에서 사용한 프롬프트 구조 예시 (Employee Reviews 분류)
# ZS + CoT + Role-Playing + Naming-the-Assistant 조합

system_prompt = """You are Robert, an AI expert who is an experienced human resource employee,
with years of experience."""

base_instruction = """Analyze the provided employee review and determine/classify
whether the employee is working from home (i.e. remotely), not remotely,
or the work location is not mentioned.
Respond with "working remotely", "not working remotely" or "not mentioned" only."""

cot_instruction = """Think step by step. Search for keywords (i.e. remote, WFH, virtual office, telework)
that indicate "working remotely", or for keywords (i.e. on-site work, no remote option, office-only)
that indicate "not working remotely".
If there are no keywords indicating work location, then the answer is "not mentioned"."""

few_shot_examples = """
### Example:
Input: Focused on Social Justice, less on business success. Mandatory in the office days with no flexibility.
Output: "not working remotely"

Input: Great company, fully remote team spread across the globe. WFH policy is excellent.
Output: "working remotely"

Input: Nice culture and good benefits. Salary is competitive.
Output: "not mentioned"
"""

review = "<실제 리뷰 텍스트>"

final_prompt = f"""
### Instruction:
{base_instruction}
{cot_instruction}
{few_shot_examples}

### Input:
"{review}"

### Response:
"""

# OpenAI API 예시
from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4-turbo",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": final_prompt}
    ],
    temperature=0  # 논문에서 재현성을 위해 0으로 설정
)
print(response.choices[0].message.content)

Terminology

weighted F1-score정확도와 재현율을 동시에 보는 지표. 클래스 불균형이 있을 때 많은 클래스에 더 가중치를 줘서 전체 성능을 하나의 숫자로 표현함.

Zero-shot (ZS)예제 없이 지시사항만 주고 바로 답하게 하는 방식. '이 글이 가짜뉴스인지 판단해'라고만 하는 것.

Few-shot (FS)지시사항과 함께 몇 가지 예시를 같이 주는 방식. 시험 전에 예제 문제 몇 개 보여주는 것과 같음.

Chain-of-Thought (CoT)모델에게 '단계별로 생각해'라고 시키는 프롬프트 기법. 복잡한 문제를 중간 추론 과정을 거쳐 풀게 함.

Quantization (양자화)모델 가중치를 32bit → 4~8bit로 압축하는 기술. 파일 용량과 메모리 사용량을 줄이면서 속도를 높임. 약간의 정확도 손실이 있을 수 있음.

RoBERTaBERT를 더 오래, 더 많은 데이터로 학습시킨 모델. 텍스트 분류 같은 이해 태스크에 강하고 가볍고 빠름.

Pareto Frontier성능과 속도 두 목표 사이에서 어느 하나를 희생하지 않고는 더 나아질 수 없는 최적 모델들의 집합. 모델 선택 시 이 선 위의 모델이 효율적인 선택.

DPO (Direct Preference Optimization)사람이 선호하는 답변을 더 잘 내놓도록 모델을 조정하는 학습 방법. RLHF보다 단순한 버전.

Related Resources

Original Abstract (Expand)

Unlocking the potential of Large Language Models (LLMs) in data classification represents a promising frontier in natural language processing. In this work, we evaluate the performance of different LLMs in comparison with state-of-the-art deep-learning and machine-learning models, in two different classification scenarios: i) the classification of employees' working locations based on job reviews posted online (multiclass classification), and 2) the classification of news articles as fake or not (binary classification). Our analysis encompasses a diverse range of language models differentiating in size, quantization, and architecture. We explore the impact of alternative prompting techniques and evaluate the models based on the weighted F1-score. Also, we examine the trade-off between performance (F1-score) and time (inference response time) for each language model to provide a more nuanced understanding of each model's practical applicability. Our work reveals significant variations in model responses based on the prompting strategies. We find that LLMs, particularly Llama3 and GPT-4, can outperform traditional methods in complex classification tasks, such as multiclass classification, though at the cost of longer inference times. In contrast, simpler ML models offer better performance-to-time trade-offs in simpler binary classification tasks.