Strategies for Span Labeling with Large Language Models

Jan 23, 2026•D. Semin, Ondrej Dusek, Zdenek Kasner•View PDF

TL;DR Highlight

An empirical study on which LLM prompting format to use for text span labeling tasks like NER and grammar error detection — XML tagging, indexing, or JSON matching — plus LOGITMATCH to fundamentally eliminate matching errors

Who Should Read

Backend/ML engineers building NER, error detection, and information extraction pipelines with LLMs. Directly applicable for developers choosing prompt formats when matching specific text spans from LLM output.

Core Mechanics

Classifies LLM-based span labeling strategies into three: XML tag wrapping (Tagging), character position output (Indexing), and JSON text content matching (Matching)
LLMs can't copy text exactly — typo corrections, case changes cause Matching mismatches with original text. Tagging has the same issue but heuristic post-processing can partially cover it
LLMs can't directly compute character indices — Indexing often generates wrong indices ignoring word boundaries. Inserting numbers into input text (INDEX-ENRICHED) helps but can hurt model performance
LOGITMATCH: constrains vocabulary to input tokens only during JSON 'text' field decoding. Implementable without fine-tuning as a vLLM LogitsProcessor
XML tagging is the most stable method overall — especially consistently superior for GEC (grammar error correction). Has a downside of using more tokens than Matching
Enforcing structured output isn't always beneficial — forcing format can suppress model's spontaneous chain-of-thought, actually degrading performance

Evidence

Direct index prediction (INDEX) is worst across all tasks on open LLMs, F1 below 24% (Qwen3-8B NER: 17.8%, Llama-3.3-70B NER: 23.3%)
Index-enriched input improves NER performance by 21-45%p (Qwen3-8B: 17.8→39.6, Llama-3.3-70B: 23.3→59.3)
In CPL task (duplicate span occurrences), adding occurrence_index improves matching by 30-40%p (Qwen3-8B MATCH 30.6 → MATCH-OCC 73.4)
Qwen3-8B with reasoning enabled (Think mode): LOGITMATCH NER hard F1: 71.4→84.2, GEC: 15.8→35.8, dramatic improvement

How to Apply

For basic span labeling, start with XML tagging — highest stability, especially for tasks like GEC where precise span boundaries matter. Include explicit instruction to 'copy the entire input and wrap the relevant spans with tags'
If using JSON Matching with words appearing multiple times (log parsing, repetitive patterns), add an occurrence field — confirmed 30-40%p improvement on CPL tasks
If running local LLMs via vLLM, consider LOGITMATCH LogitsProcessor — fundamentally eliminates alignment issues in Matching for non-standard tokenized text (NLP-preprocessed input, etc.)

Code Example

snippet

# Three strategy prompt examples

# 1. XML Tagging — most stable
tagging_prompt = """
Extract named entities (PER, ORG, LOC) from the text.
Surround spans with XML tags. Copy the ENTIRE input text including non-tagged parts.

Example:
Input: Turing was born in London.
Output: <entity type="PER">Turing</entity> was born in <entity type="LOC">London</entity>.

Input: {input_text}
Output:"""

# 2. JSON Matching — token efficient
matching_prompt = """
Extract named entities (PER, ORG, LOC) from the text.
Return a valid JSON array only. Use exact text from input.

Example:
Input: Turing was born in London.
Output: [{"text": "Turing", "label": "PER"}, {"text": "London", "label": "LOC"}]

Input: {input_text}
Output:"""

# 3. JSON Matching + occurrence — handles duplicate spans
matching_occ_prompt = """
Extract named entities. Include occurrence index to disambiguate repeated spans.

Example:
Input: The Paris agreement was signed in Paris.
Output: [{"text": "Paris", "label": "ORG", "occurrence": 1},
         {"text": "Paris", "label": "LOC", "occurrence": 2}]

Input: {input_text}
Output:"""

# LOGITMATCH — implemented with vLLM LogitsProcessor (local LLM only)
# https://github.com/semindan/span_labeling reference

Terminology

Span LabelingFinding and labeling specific segments in text (e.g., 'Seoul', 'grammar error part'). NER and error detection are representative examples.

Constrained DecodingRestricting the LLM to only select from an allowed token list when choosing the next token. Like only allowing answers from given choices on a multiple-choice test.

LogitsProcessorA component that manipulates each candidate token's score (logit) during LLM token generation. Used to prohibit or force specific tokens.

BIO TagsA classical method tagging each word as B (span beginning), I (span inside), or O (outside span). Mainly used with encoder models like BERT.

NERNamed Entity Recognition. A task of finding proper nouns like person names, organizations, and places in text.

structured outputForcing the LLM to only produce output matching a specific JSON schema or format. Supported by frameworks like vLLM and Guidance.

vLLMAn open-source framework for fast LLM serving. Efficiently manages GPU memory using the PagedAttention technique.

Related Resources

https://github.com/semindan/span_labeling

Original Abstract (Expand)

Large language models (LLMs) are increasingly used for text analysis tasks, such as named entity recognition or error detection. Unlike encoder-based models, however, generative architectures lack an explicit mechanism to refer to specific parts of their input. This leads to a variety of ad-hoc prompting strategies for span labeling, often with inconsistent results. In this paper, we categorize these strategies into three families: tagging the input text, indexing numerical positions of spans, and matching span content. To address the limitations of content matching, we introduce LogitMatch, a new constrained decoding method that forces the model's output to align with valid input spans. We evaluate all methods across four diverse tasks. We find that while tagging remains a robust baseline, LogitMatch improves upon competitive matching-based methods by eliminating span matching issues and outperforms other strategies in some setups.