Claim Automation using Large Language Model

Feb 18, 2026•Zhengda Mo, Zhiyu Quan, Eli O'Donohue +1•View PDF

TL;DR Highlight

An 8B model smaller than GPT-5 achieves 92% accuracy on warranty claim processing with just LoRA fine-tuning, beating all commercial LLMs.

Who Should Read

ML/backend engineers looking to integrate LLMs into operational pipelines in regulated domains like insurance and finance. Also useful for devs deciding between Prompt Engineering and fine-tuning for domain-specific text generation tasks.

Core Mechanics

LoRA fine-tuning DeepSeek-R1-Distill-Llama-8B on 2M automotive warranty claim records outperforms GPT-5.2, GPT-4.1, GPT-4o-mini, Claude Haiku 4.5, and Gemini-2.5-Flash by BERT similarity
Prompt engineering alone had only 6.5% format compliance rate for structured outputs, but jumped to 100% after LoRA fine-tuning — immediately usable in structured automation pipelines
Human evaluation accuracy: non-fine-tuned models plateau at 56-64%, while the fine-tuned model hits 81.5% (92% on high-quality data subset) — a qualitatively different performance tier
Tire sidewall damage predictions (2,953 cases): DeepSeek+Prompt predicted "repair" incorrectly for 262 cases vs just 1 for the fine-tuned model — proves actual internalization of domain operational rules
Evaluation metrics comparison: surface similarity metrics like BLEU and edit distance correlate worse with human judgment than BERT cosine similarity and LLM-as-a-Judge (Spearman ρ 0.733, 0.724)
Modular design assigns only "intermediate task (generating claim corrective actions)" to the LLM rather than end-to-end processing — ensures auditability and regulatory compliance

Evidence

Format compliance: fine-tuned (M4) 100% vs Qwen-Instruct+Prompt (M3) 86.5% vs DeepSeek+Prompt (M2) 6.5%
Accuracy (Acc Valid): fine-tuned 81.5% vs best non-fine-tuned (DeepSeek+Prompt) 64.4%; on HQ subset 92.0% vs 71.2%
Average BERT cosine similarity (1,500 total): fine-tuned 0.869 > Gemini-2.5-Flash 0.799 > GPT-4o-mini 0.787 > Claude Haiku 4.5 0.757 > GPT-4.1 0.749 > GPT-5.2 0.719
High-quality prediction ratio (κ=0.77): fine-tuned 79.1% vs Gemini-2.5-Flash 68.5% vs GPT-5.2 47.2%

How to Apply

In regulated domains, consider LoRA fine-tuning local open-source models (Llama/DeepSeek family) with HuggingFace PEFT+TRL SFTTrainer instead of external API LLMs — this paper's defaults: r=32, α=32, lr=6e-5, 1 epoch, AdamW, FP16
Narrowly defining the LLM's role as "generating structured intermediate output units (e.g., corrective action text)" rather than handling the full pipeline makes validation and auditing easier and guarantees 100% format compliance
When evaluating LLM output quality, use BERT cosine similarity (all-mpnet-base-v2) or LLM-as-a-Judge as the default metric instead of BLEU — much higher correlation with human judgment

Code Example

snippet

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig

# 1. Load base model
model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="float16")

# 2. LoRA configuration (paper defaults)
lora_config = LoraConfig(
    r=32,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.0,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)

# 3. Training data format: instruction(Complaint+Cause) + response(Correction)
def format_claim(example):
    return {
        "text": f"""You are given a warranty claim description.
Your task: Output ONLY the corrective action.
Claim description: {example['complaint']} {example['cause']}
corrective action include: {example['correction']}"""
    }

# 4. Train with SFTTrainer (loss computed on response tokens only)
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset.map(format_claim),
    args=SFTConfig(
        per_device_train_batch_size=8,
        gradient_accumulation_steps=4,
        learning_rate=6e-5,
        num_train_epochs=1,
        fp16=True,
        max_seq_length=2048,
    ),
)
trainer.train()

Terminology

LoRAInstead of retraining the entire model (billions of parameters), you insert just two small matrices A and B and train only those. Like replacing just the lenses to correct your vision — the original model stays unchanged, you just swap the adapter.

SFT (Supervised Fine-Tuning)Teaching a model by showing it correct examples (input→output pairs) to imitate. Same as a student seeing worked examples in class and then solving similar problems.

BERTScoreA metric that measures semantic similarity between predicted text and ground truth using BERT embeddings. Gives high scores even when wording differs, as long as the meaning is the same.

LLM-as-a-JudgeAn evaluation method where one LLM scores another LLM's output. GPT or Claude plays the role of grader instead of humans.

LoRA rank (r)The size of the low-rank matrices to train in LoRA. r=32 means training only 32-dimensional matrices. Smaller r means fewer parameters and faster training but also lower expressiveness.

RoPE (Rotary Position Embedding)A technique that encodes position information into embeddings using rotational transformations so the transformer understands word order. Used by LLaMA and DeepSeek family models.

BLEURTA neural evaluation model trained on human evaluation scores. Given predicted text and a reference, it predicts what score a human would assign.

Related Resources

Original Abstract (Expand)

While Large Language Models (LLMs) have achieved strong performance on general-purpose language tasks, their deployment in regulated and data-sensitive domains, including insurance, remains limited. Leveraging millions of historical warranty claims, we propose a locally deployed governance-aware language modeling component that generates structured corrective-action recommendations from unstructured claim narratives. We fine-tune pretrained LLMs using Low-Rank Adaptation (LoRA), scoping the model to an initial decision module within the claim processing pipeline to speed up claim adjusters'decisions. We assess this module using a multi-dimensional evaluation framework that combines automated semantic similarity metrics with human evaluation, enabling a rigorous examination of both practical utility and predictive accuracy. Our results show that domain-specific fine-tuning substantially outperforms commercial general-purpose and prompt-based LLMs, with approximately 80% of the evaluated cases achieving near-identical matches to ground-truth corrective actions. Overall, this study provides both theoretical and empirical evidence to prove that domain-adaptive fine-tuning can align model output distributions more closely with real-world operational data, demonstrating its promise as a reliable and governable building block for insurance applications.