Strategies for Span Labeling with Large Language Models
TL;DR Highlight
An empirical study on which LLM prompting format to use for text span labeling tasks like NER and grammar error detection — XML tagging, indexing, or JSON matching — plus LOGITMATCH to fundamentally eliminate matching errors
Who Should Read
Backend/ML engineers building NER, error detection, and information extraction pipelines with LLMs. Directly applicable for developers choosing prompt formats when matching specific text spans from LLM output.
Core Mechanics
- Classifies LLM-based span labeling strategies into three: XML tag wrapping (Tagging), character position output (Indexing), and JSON text content matching (Matching)
- LLMs can't copy text exactly — typo corrections, case changes cause Matching mismatches with original text. Tagging has the same issue but heuristic post-processing can partially cover it
- LLMs can't directly compute character indices — Indexing often generates wrong indices ignoring word boundaries. Inserting numbers into input text (INDEX-ENRICHED) helps but can hurt model performance
- LOGITMATCH: constrains vocabulary to input tokens only during JSON 'text' field decoding. Implementable without fine-tuning as a vLLM LogitsProcessor
- XML tagging is the most stable method overall — especially consistently superior for GEC (grammar error correction). Has a downside of using more tokens than Matching
- Enforcing structured output isn't always beneficial — forcing format can suppress model's spontaneous chain-of-thought, actually degrading performance
Evidence
- Direct index prediction (INDEX) is worst across all tasks on open LLMs, F1 below 24% (Qwen3-8B NER: 17.8%, Llama-3.3-70B NER: 23.3%)
- Index-enriched input improves NER performance by 21-45%p (Qwen3-8B: 17.8→39.6, Llama-3.3-70B: 23.3→59.3)
- In CPL task (duplicate span occurrences), adding occurrence_index improves matching by 30-40%p (Qwen3-8B MATCH 30.6 → MATCH-OCC 73.4)
- Qwen3-8B with reasoning enabled (Think mode): LOGITMATCH NER hard F1: 71.4→84.2, GEC: 15.8→35.8, dramatic improvement
How to Apply
- For basic span labeling, start with XML tagging — highest stability, especially for tasks like GEC where precise span boundaries matter. Include explicit instruction to 'copy the entire input and wrap the relevant spans with tags'
- If using JSON Matching with words appearing multiple times (log parsing, repetitive patterns), add an occurrence field — confirmed 30-40%p improvement on CPL tasks
- If running local LLMs via vLLM, consider LOGITMATCH LogitsProcessor — fundamentally eliminates alignment issues in Matching for non-standard tokenized text (NLP-preprocessed input, etc.)
Code Example
# Three strategy prompt examples
# 1. XML Tagging — most stable
tagging_prompt = """
Extract named entities (PER, ORG, LOC) from the text.
Surround spans with XML tags. Copy the ENTIRE input text including non-tagged parts.
Example:
Input: Turing was born in London.
Output: <entity type="PER">Turing</entity> was born in <entity type="LOC">London</entity>.
Input: {input_text}
Output:"""
# 2. JSON Matching — token efficient
matching_prompt = """
Extract named entities (PER, ORG, LOC) from the text.
Return a valid JSON array only. Use exact text from input.
Example:
Input: Turing was born in London.
Output: [{"text": "Turing", "label": "PER"}, {"text": "London", "label": "LOC"}]
Input: {input_text}
Output:"""
# 3. JSON Matching + occurrence — handles duplicate spans
matching_occ_prompt = """
Extract named entities. Include occurrence index to disambiguate repeated spans.
Example:
Input: The Paris agreement was signed in Paris.
Output: [{"text": "Paris", "label": "ORG", "occurrence": 1},
{"text": "Paris", "label": "LOC", "occurrence": 2}]
Input: {input_text}
Output:"""
# LOGITMATCH — implemented with vLLM LogitsProcessor (local LLM only)
# https://github.com/semindan/span_labeling referenceTerminology
Related Resources
Original Abstract (Expand)
Large language models (LLMs) are increasingly used for text analysis tasks, such as named entity recognition or error detection. Unlike encoder-based models, however, generative architectures lack an explicit mechanism to refer to specific parts of their input. This leads to a variety of ad-hoc prompting strategies for span labeling, often with inconsistent results. In this paper, we categorize these strategies into three families: tagging the input text, indexing numerical positions of spans, and matching span content. To address the limitations of content matching, we introduce LogitMatch, a new constrained decoding method that forces the model's output to align with valid input spans. We evaluate all methods across four diverse tasks. We find that while tagging remains a robust baseline, LogitMatch improves upon competitive matching-based methods by eliminating span matching issues and outperforms other strategies in some setups.