Cross-Context Review: Improving LLM Output Quality by Separating Production and Review Sessions

Mar 12, 2026•Tae-Eun Song•View PDF

TL;DR Highlight

LLMs fail to catch errors when reviewing their own outputs in the same session — but review in a fresh session pushes F1 up to 28.6%.

Who Should Read

Engineers building LLM-based review or quality-checking pipelines, and researchers studying self-correction and self-evaluation in language models.

Core Mechanics

LLMs reviewing their own outputs in the same conversation context fail to catch errors they just made — cognitive and contextual bias causes them to echo their own mistakes
Starting a fresh session (no shared context with the generation session) significantly improves error detection rate
Fresh-session review achieved F1 of 28.6% on catching generation errors, vs near-zero for same-session review
The effect holds across multiple model families and task types
This has direct implications for how LLM quality-check pipelines should be architected
Simple prompt-based self-correction in the same session is largely ineffective

Evidence

Same-session self-review F1 near zero across tested models and tasks
Fresh-session review achieves F1 28.6% on the same error detection task
Results consistent across multiple LLM families tested
Task-level analysis shows the gap is widest for factual and reasoning errors

How to Apply

When building LLM review pipelines, always use a separate API call / fresh session context for the reviewer, never the same conversation thread
Consider using a different model instance or system prompt for the reviewer role to maximize independence from the generator
Same-session 'check your work' prompting is unreliable — architect review as a distinct, stateless step

Code Example

snippet

# CCR Review Prompt Template (based on paper Appendix A)
CCR_REVIEW_PROMPT = """
Review the following {artifact_type} from a fresh perspective:

1. Factual accuracy: Are numbers, names, dates, and technical claims correct?
2. Internal consistency: Are there contradictions or terminology mismatches?
3. Contextual fitness: Would this work correctly in its intended environment?
4. Audience perspective: Could the target reader misinterpret any part?
5. Completeness: Is anything important missing?

For each issue found, provide:
- Location (line number or section)
- Description of the error
- Type (FACT/CONS/CTXT/RCVR/MISS)
- Severity (Critical/Major/Minor)
- Suggested fix

--- ARTIFACT START ---
{artifact_content}
--- ARTIFACT END ---
"""

# CCR Application Example (OpenAI API)
from openai import OpenAI

def cross_context_review(artifact_content: str, artifact_type: str = "code") -> str:
    """
    New client instance = new context.
    Never include production conversation history in messages.
    """
    client = OpenAI()  # New session — no previous conversation history
    
    prompt = CCR_REVIEW_PROMPT.format(
        artifact_type=artifact_type,
        artifact_content=artifact_content
    )
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            # Important: Do NOT include production session message history!
            {"role": "user", "content": prompt}
        ]
    )
    return response.choices[0].message.content

# Usage
generated_code = """
def get_business_days(start, end):
    count = 0
    for d in range((end - start).days):
        if (start + timedelta(d)).weekday() >= 6:  # Bug: should be 5
            continue
        count += 1
    return count
"""

review_result = cross_context_review(generated_code, artifact_type="Python function")
print(review_result)

Terminology

Self-correctionThe ability of an LLM to identify and fix mistakes in its own previous outputs, either in the same or a separate context.

Fresh Session ReviewSending the generated content to a new LLM conversation with no memory of how it was generated, removing contextual bias.

F1 ScoreHarmonic mean of precision and recall — used here to measure how well the reviewer finds real errors without too many false alarms.

Contextual BiasThe tendency for an LLM to be influenced by its own previous outputs in the same conversation, making it harder to critically evaluate them.

Original Abstract (Expand)

Large language models struggle to catch errors in their own outputs when the review happens in the same session that produced them. This paper introduces Cross-Context Review (CCR), a straightforward method where the review is conducted in a fresh session with no access to the production conversation history. We ran a controlled experiment: 30 artifacts (code, technical documents, presentation scripts) with 150 injected errors, tested under four review conditions -- same-session Self-Review (SR), repeated Self-Review (SR2), context-aware Subagent Review (SA), and Cross-Context Review (CCR). Over 360 reviews, CCR reached an F1 of 28.6%, outperforming SR (24.6%, p=0.008, d=0.52), SR2 (21.7%, p<0.001, d=0.72), and SA (23.8%, p=0.004, d=0.57). The SR2 result matters most for interpretation: reviewing twice in the same session did not beat reviewing once (p=0.11), which rules out repetition as an explanation for CCR's advantage. The benefit comes from context separation itself. CCR works with any model, needs no infrastructure, and costs only one extra session.