Large Language Models For Text Classification: Case Study And Comprehensive Review

Jan 14, 2025•A. Kostina, M. Dikaiakos, Dimosthenis Stefanidis +1•View PDF

TL;DR Highlight

Benchmark study measuring actual performance and speed tradeoffs by comparing 10 LLMs including GPT-4 and Llama3 against traditional ML models for text classification.

Who Should Read

Backend/ML engineers building text classification pipelines, or developers deciding between LLMs and traditional ML models.

Core Mechanics

For complex 3-class classification, GPT-4-turbo (87.6%) and Llama3 70B (87.1%) outperformed RoBERTa (83.8%) and SVM (68.7%), but inference time was 2500s vs 15s
For simple binary classification (fake news detection), RoBERTa won at 93.0% vs GPT-4-turbo (83.7%), and NB/SVM achieved 88-90%, beating most LLMs
CoT (Chain-of-Thought) consistently gave the most reliable performance boost; combining with Few-shot improved F1 by up to 22.2% vs basic ZS
Role-Playing + Naming-the-Assistant combo had wildly inconsistent effects — helpful for some models, harmful for others
AWQ quantized Mistral-OO scored 4.5% higher than standard Mistral on Employee Reviews — quantization doesn't always mean performance loss
Lower-performing models are more sensitive to prompt wording changes (Llama2 had a 42.3% F1 gap between best and worst prompts on the same task)

Evidence

3-class classification: GPT-4-turbo 87.6%, Llama3 70B 87.1%, RoBERTa 83.8%, SVM 68.7% (weighted F1-score)
Binary classification: RoBERTa 93.0%, Llama3 70B 94.4%, NB 90.0%, SVM 88.8%, GPT-4-turbo best 83.7%
Inference time: GPT-4-turbo ~2500s, RoBERTa 15s, SVM/NB under 1s (1000 Employee Reviews samples)
Prompt effect: FS+COT+RP+NA combo achieved up to 22.2%p F1 improvement vs basic ZS (Employee Reviews, Xwin model)

How to Apply

For binary classification tasks where cost/speed matters (spam, sentiment), fine-tuned RoBERTa or SVM is more practical than LLMs — similar accuracy, hundreds of times faster.
For complex multi-class tasks where you need an LLM, don't start with basic ZS — test ZS+CoT or FS+CoT+RP+NA combos first.
When model budget is limited, try AWQ quantized model versions — good fine-tuning data quality can actually improve performance after quantization.

Code Example

snippet

# Example prompt structure used in the paper (Employee Reviews classification)
# ZS + CoT + Role-Playing + Naming-the-Assistant combination

system_prompt = """You are Robert, an AI expert who is an experienced human resource employee,
with years of experience."""

base_instruction = """Analyze the provided employee review and determine/classify
whether the employee is working from home (i.e. remotely), not remotely,
or the work location is not mentioned.
Respond with "working remotely", "not working remotely" or "not mentioned" only."""

cot_instruction = """Think step by step. Search for keywords (i.e. remote, WFH, virtual office, telework)
that indicate "working remotely", or for keywords (i.e. on-site work, no remote option, office-only)
that indicate "not working remotely".
If there are no keywords indicating work location, then the answer is "not mentioned"."""

few_shot_examples = """
### Example:
Input: Focused on Social Justice, less on business success. Mandatory in the office days with no flexibility.
Output: "not working remotely"

Input: Great company, fully remote team spread across the globe. WFH policy is excellent.
Output: "working remotely"

Input: Nice culture and good benefits. Salary is competitive.
Output: "not mentioned"
"""

review = "<actual review text>"

final_prompt = f"""
### Instruction:
{base_instruction}
{cot_instruction}
{few_shot_examples}

### Input:
"{review}"

### Response:
"""

# OpenAI API example
from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4-turbo",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": final_prompt}
    ],
    temperature=0  # Set to 0 for reproducibility as in the paper
)
print(response.choices[0].message.content)

Terminology

weighted F1-scoreA metric considering both precision and recall. Weights larger classes more when there's class imbalance, expressing overall performance in one number.

Zero-shot (ZS)Having the model answer with just instructions and no examples.

Few-shot (FS)Providing a few examples alongside the instructions.

Chain-of-Thought (CoT)A prompting technique telling the model to 'think step by step.' Gets the model to work through intermediate reasoning before answering.

QuantizationCompressing model weights from 32-bit to 4-8 bit. Reduces file size and memory usage while increasing speed.

RoBERTaBERT trained longer with more data. Strong at comprehension tasks like text classification — lightweight and fast.

Pareto FrontierThe set of optimal models where you can't improve one objective without sacrificing the other.

DPO (Direct Preference Optimization)A method for adjusting a model to produce preferred responses. A simpler version of RLHF.

Related Resources

Original Abstract (Expand)

Unlocking the potential of Large Language Models (LLMs) in data classification represents a promising frontier in natural language processing. In this work, we evaluate the performance of different LLMs in comparison with state-of-the-art deep-learning and machine-learning models, in two different classification scenarios: i) the classification of employees' working locations based on job reviews posted online (multiclass classification), and 2) the classification of news articles as fake or not (binary classification). Our analysis encompasses a diverse range of language models differentiating in size, quantization, and architecture. We explore the impact of alternative prompting techniques and evaluate the models based on the weighted F1-score. Also, we examine the trade-off between performance (F1-score) and time (inference response time) for each language model to provide a more nuanced understanding of each model's practical applicability. Our work reveals significant variations in model responses based on the prompting strategies. We find that LLMs, particularly Llama3 and GPT-4, can outperform traditional methods in complex classification tasks, such as multiclass classification, though at the cost of longer inference times. In contrast, simpler ML models offer better performance-to-time trade-offs in simpler binary classification tasks.