BTZSC: Zero-Shot Text Classification 벤치마크 — Cross-Encoder, Embedding Model, Reranker, LLM 비교

BTZSC: A Benchmark for Zero-Shot Text Classification Across Cross-Encoders, Embedding Models, Rerankers and LLMs

Mar 12, 2026•Ilias Aarab•View PDF

TL;DR Highlight

라벨 데이터 없이 텍스트 분류할 때 어떤 모델 써야 하는지, 38개 모델을 22개 데이터셋으로 직접 비교한 결과.

Who Should Read

레이블링 비용 없이 텍스트 분류 파이프라인을 구축하려는 ML 엔지니어나 백엔드 개발자. 특히 감성 분석, 인텐트 감지, 토픽 분류 등을 zero-shot으로 처리하고 싶은 경우.

Core Mechanics

Qwen3-Reranker-8B가 macro F1 0.72로 전체 1위 — 기존 NLI cross-encoder 최강 모델보다 F1 +12점, accuracy +14점 차이
속도 대비 성능 트레이드오프는 embedding model이 가장 좋음 — gte-large-en-v1.5 (F1 0.62)가 NLI cross-encoder 전부를 이기면서도 빠름
NLI cross-encoder(BART-MNLI 같은 구조)는 모델 크게 키워도 성능 정체 — 최고점이 deberta-v3-large-nli-triplet의 F1 0.60
LLM은 4B 이상부터 경쟁력 생김 — Qwen3-4B가 F1 0.65, Mistral-Nemo-12B가 0.67이지만 Qwen3-Reranker-8B보다 여전히 5점 낮음
Embedding model은 스케일 업 효과가 거의 없음 — Qwen3-Embedding-8B(F1 0.59) vs 0.6B(F1 0.58), 차이 미미
감정(emotion) 분류가 가장 어려운 태스크 — 모든 모델 계열에서 F1 0.25~0.5 수준, 토픽/감성 대비 훨씬 낮음

Evidence

Qwen3-Reranker-8B: macro F1 0.72, accuracy 0.76 — 전체 38개 모델 중 1위, NLI 최강 deberta-v3-large-nli-triplet(F1 0.60) 대비 +12 F1
gte-large-en-v1.5: macro F1 0.62로 NLI cross-encoder 전체를 상회하면서 inference 속도는 reranker/LLM 대비 훨씬 빠름 (Pareto 효율 최상위권)
BTZSC vs MTEB 랭킹 Kendall τ = 0.69 (p < 10^-8) — 두 벤치마크의 모델 순위가 강하게 일치해 결과의 신뢰성 확인
Qwen3-Reranker-0.6B(F1 0.61)조차 NLI cross-encoder 전체를 F1 기준 초과 — 작은 reranker도 기존 방식보다 강함

How to Apply

레이턴시가 중요한 실시간 분류 서비스라면 gte-large-en-v1.5 또는 gte-modernbert-base를 코사인 유사도 기반 zero-shot 분류기로 바로 사용 — 각 라벨을 'The sentiment of this review is {label}' 형태 문장으로 verbalize하고 텍스트 임베딩과 코사인 유사도 계산
정확도가 최우선이고 배치 처리가 가능한 경우 Qwen3-Reranker-8B 사용 — 입력 텍스트를 query로, verbalized 라벨들을 document로 넣어 reranking 점수로 분류
LLM 기반 파이프라인에 이미 Qwen3 계열을 쓰고 있다면 Qwen3-4B를 multiple-choice 프롬프트로 zero-shot 분류기로 활용 가능 — 각 라벨에 A/B/C 옵션 붙여서 next-token probability로 선택

Code Example

snippet

# gte-large-en-v1.5로 zero-shot 감성 분류 예시
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('Alibaba-NLP/gte-large-en-v1.5', trust_remote_code=True)

text = "This product exceeded all my expectations. Highly recommend!"

# 라벨 verbalization
label_descriptions = [
    "The overall sentiment within the Amazon product review is positive",
    "The overall sentiment within the Amazon product review is negative"
]

# 임베딩 계산
text_emb = model.encode(text, normalize_embeddings=True)
label_embs = model.encode(label_descriptions, normalize_embeddings=True)

# 코사인 유사도로 분류
scores = text_emb @ label_embs.T
predicted_label = ["positive", "negative"][np.argmax(scores)]
print(f"Predicted: {predicted_label}, Scores: {scores}")

# Qwen3-Reranker-8B 방식 (transformers 직접 사용)
# query = text, documents = label_descriptions
# yes/no 토큰 logit으로 관련도 점수 계산 후 argmax

Terminology

Zero-Shot Classification학습 때 한 번도 본 적 없는 라벨을 예측하는 것. 새 카테고리가 생겨도 재학습 없이 라벨 이름만 바꿔서 바로 쓸 수 있음.

NLI Cross-Encoder자연어 추론(Natural Language Inference) 데이터로 학습된 모델. 텍스트와 라벨 설명을 쌍으로 넣어 '이 텍스트가 이 라벨에 해당하는가'를 판단함.

Reranker검색 결과를 다시 정렬하는 모델. RAG에서 top-k 문서를 재채점할 때 쓰이며, 여기선 라벨들을 '관련 문서'로 취급해 분류에 활용.

Macro F1각 클래스별 F1 점수를 동등하게 평균낸 값. 클래스 불균형이 있어도 소수 클래스 성능을 공평하게 반영함.

Embedding Model텍스트를 고차원 벡터로 변환하는 모델. 의미가 비슷한 텍스트끼리 벡터 공간에서 가까이 위치하도록 학습됨.

Verbalization숫자 라벨을 자연어 문장으로 표현하는 것. 예: 클래스 1 → 'The sentiment is positive'. Zero-shot 분류의 핵심 기법.

Instruction-Tuned LLM자연어 지시를 따르도록 추가 학습된 언어 모델. GPT처럼 '~해줘'라고 하면 따라하는 모델들이 여기 해당함.

MTEBMassive Text Embedding Benchmark. 임베딩 모델 성능을 여러 태스크로 평가하는 표준 벤치마크. 근데 분류 태스크에서 라벨 데이터를 일부 쓰는 게 BTZSC와의 차이점.

Related Resources

Original Abstract (Expand)

Zero-shot text classification (ZSC) offers the promise of eliminating costly task-specific annotation by matching texts directly to human-readable label descriptions. While early approaches have predominantly relied on cross-encoder models fine-tuned for natural language inference (NLI), recent advances in text-embedding models, rerankers, and instruction-tuned large language models (LLMs) have challenged the dominance of NLI-based architectures. Yet, systematically comparing these diverse approaches remains difficult. Existing evaluations, such as MTEB, often incorporate labeled examples through supervised probes or fine-tuning, leaving genuine zero-shot capabilities underexplored. To address this, we introduce BTZSC, a comprehensive benchmark of 22 public datasets spanning sentiment, topic, intent, and emotion classification, capturing diverse domains, class cardinalities, and document lengths. Leveraging BTZSC, we conduct a systematic comparison across four major model families, NLI cross-encoders, embedding models, rerankers and instruction-tuned LLMs, encompassing 38 public and custom checkpoints. Our results show that: (i) modern rerankers, exemplified by Qwen3-Reranker-8B, set a new state-of-the-art with macro F1 = 0.72; (ii) strong embedding models such as GTE-large-en-v1.5 substantially close the accuracy gap while offering the best trade-off between accuracy and latency; (iii) instruction-tuned LLMs at 4--12B parameters achieve competitive performance (macro F1 up to 0.67), excelling particularly on topic classification but trailing specialized rerankers; (iv) NLI cross-encoders plateau even as backbone size increases; and (v) scaling primarily benefits rerankers and LLMs over embedding models. BTZSC and accompanying evaluation code are publicly released to support fair and reproducible progress in zero-shot text understanding.