Claim Automation using Large Language Model
TL;DR Highlight
An 8B model smaller than GPT-5 achieves 92% accuracy on warranty claim processing with just LoRA fine-tuning, beating all commercial LLMs.
Who Should Read
ML/backend engineers looking to integrate LLMs into operational pipelines in regulated domains like insurance and finance. Also useful for devs deciding between Prompt Engineering and fine-tuning for domain-specific text generation tasks.
Core Mechanics
- LoRA fine-tuning DeepSeek-R1-Distill-Llama-8B on 2M automotive warranty claim records outperforms GPT-5.2, GPT-4.1, GPT-4o-mini, Claude Haiku 4.5, and Gemini-2.5-Flash by BERT similarity
- Prompt engineering alone had only 6.5% format compliance rate for structured outputs, but jumped to 100% after LoRA fine-tuning — immediately usable in structured automation pipelines
- Human evaluation accuracy: non-fine-tuned models plateau at 56-64%, while the fine-tuned model hits 81.5% (92% on high-quality data subset) — a qualitatively different performance tier
- Tire sidewall damage predictions (2,953 cases): DeepSeek+Prompt predicted "repair" incorrectly for 262 cases vs just 1 for the fine-tuned model — proves actual internalization of domain operational rules
- Evaluation metrics comparison: surface similarity metrics like BLEU and edit distance correlate worse with human judgment than BERT cosine similarity and LLM-as-a-Judge (Spearman ρ 0.733, 0.724)
- Modular design assigns only "intermediate task (generating claim corrective actions)" to the LLM rather than end-to-end processing — ensures auditability and regulatory compliance
Evidence
- Format compliance: fine-tuned (M4) 100% vs Qwen-Instruct+Prompt (M3) 86.5% vs DeepSeek+Prompt (M2) 6.5%
- Accuracy (Acc Valid): fine-tuned 81.5% vs best non-fine-tuned (DeepSeek+Prompt) 64.4%; on HQ subset 92.0% vs 71.2%
- Average BERT cosine similarity (1,500 total): fine-tuned 0.869 > Gemini-2.5-Flash 0.799 > GPT-4o-mini 0.787 > Claude Haiku 4.5 0.757 > GPT-4.1 0.749 > GPT-5.2 0.719
- High-quality prediction ratio (κ=0.77): fine-tuned 79.1% vs Gemini-2.5-Flash 68.5% vs GPT-5.2 47.2%
How to Apply
- In regulated domains, consider LoRA fine-tuning local open-source models (Llama/DeepSeek family) with HuggingFace PEFT+TRL SFTTrainer instead of external API LLMs — this paper's defaults: r=32, α=32, lr=6e-5, 1 epoch, AdamW, FP16
- Narrowly defining the LLM's role as "generating structured intermediate output units (e.g., corrective action text)" rather than handling the full pipeline makes validation and auditing easier and guarantees 100% format compliance
- When evaluating LLM output quality, use BERT cosine similarity (all-mpnet-base-v2) or LLM-as-a-Judge as the default metric instead of BLEU — much higher correlation with human judgment
Code Example
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig
# 1. Load base model
model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="float16")
# 2. LoRA configuration (paper defaults)
lora_config = LoraConfig(
r=32,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
lora_dropout=0.0,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
# 3. Training data format: instruction(Complaint+Cause) + response(Correction)
def format_claim(example):
return {
"text": f"""You are given a warranty claim description.
Your task: Output ONLY the corrective action.
Claim description: {example['complaint']} {example['cause']}
corrective action include: {example['correction']}"""
}
# 4. Train with SFTTrainer (loss computed on response tokens only)
trainer = SFTTrainer(
model=model,
train_dataset=train_dataset.map(format_claim),
args=SFTConfig(
per_device_train_batch_size=8,
gradient_accumulation_steps=4,
learning_rate=6e-5,
num_train_epochs=1,
fp16=True,
max_seq_length=2048,
),
)
trainer.train()Terminology
Related Resources
Original Abstract (Expand)
While Large Language Models (LLMs) have achieved strong performance on general-purpose language tasks, their deployment in regulated and data-sensitive domains, including insurance, remains limited. Leveraging millions of historical warranty claims, we propose a locally deployed governance-aware language modeling component that generates structured corrective-action recommendations from unstructured claim narratives. We fine-tune pretrained LLMs using Low-Rank Adaptation (LoRA), scoping the model to an initial decision module within the claim processing pipeline to speed up claim adjusters'decisions. We assess this module using a multi-dimensional evaluation framework that combines automated semantic similarity metrics with human evaluation, enabling a rigorous examination of both practical utility and predictive accuracy. Our results show that domain-specific fine-tuning substantially outperforms commercial general-purpose and prompt-based LLMs, with approximately 80% of the evaluated cases achieving near-identical matches to ground-truth corrective actions. Overall, this study provides both theoretical and empirical evidence to prove that domain-adaptive fine-tuning can align model output distributions more closely with real-world operational data, demonstrating its promise as a reliable and governable building block for insurance applications.