GlyphBanana: Advancing Precise Text Rendering Through Agentic Workflows

Mar 12, 2026•Zexuan Yan, Jiarui Jin, Yue Ma +5•View PDF

TL;DR Highlight

A training-free agent pipeline that accurately renders text in images — even mathematical formulas and rare CJK characters.

Who Should Read

Developers building image generation pipelines that need accurate text rendering (design tools, document generation, localization), and researchers working on visual text generation.

Core Mechanics

Existing text-to-image models fail badly at rendering accurate text, especially math formulas and rare characters
Proposed an agent pipeline that decomposes text rendering into: layout planning, character-level rendering, and composition
No training or fine-tuning required — works by orchestrating existing tools and models
Handles complex typography challenges: mathematical symbols, CJK characters, mixed scripts
The pipeline produces images with accurately placed, correctly rendered text overlays
Agent dynamically selects appropriate rendering strategies based on text content type

Evidence

Significantly outperforms base text-to-image models on text accuracy metrics for both Latin and CJK scripts
Correctly renders mathematical formulas that base models completely fail on
Zero-shot generalization to rare characters not seen during development
User studies confirm perceived quality improvement for text-heavy image generation tasks

How to Apply

Use this pipeline for any image generation task requiring accurate text — replace direct model inference with the agent pipeline
The modular design lets you swap in different base models without changing the text rendering logic
For math-heavy content (textbooks, scientific figures), this approach is currently the most reliable training-free solution

Code Example

snippet

# GlyphBanana Typography Analysis Prompt (can be used directly with VLM)
typography_analysis_prompt = """
You are an expert in image typography analysis.
Given a reference image with a 5x5 grid and coordinate annotations,
analyze the natural text rendering style and overall scene.
Then plan the best typography layout for each text/formula item.

Critical constraints:
- Bounding boxes must remain flat and frontal (no perspective distortion)
- Red grid lines are positioning aids only — ignore them for style description
- Grid coordinates: {0.0, 0.2, 0.4, 0.6, 0.8, 1.0} on each axis

Per-region output fields:
- content: target text or formula
- bbox: [xmin, ymin, xmax, ymax] in [0,1]
- font: from registered font list or 'auto'
- font_weight: light/regular/bold
- font_size_ratio: scalar in [0.1, 1.0]
- color: white/black/red/blue/green/yellow/orange/brown/gray/gold/silver/purple/pink
- is_latex: boolean
- alignment: left/center/right
- rotation: degrees (0 = horizontal)

Return strict JSON with keys: image_analysis, text_regions
"""

# Generate Clean Prompt (keep only background, remove text instructions)
clean_prompt_template = """
Remove ALL quoted text, formulas, and text-rendering instructions from the prompt.
Keep ONLY the scene/background/style description.
Add 'no text visible' at the end.

Example:
Input: A classroom blackboard displays "E=mc²" in elegant chalk writing.
Output: An empty classroom blackboard as background, clear and without any text. No text visible.

Input prompt: {user_prompt}
Output ONLY the cleaned prompt, nothing else.
"""

# Generate Style Prompt (for text-background harmonization)
style_prompt_template = """
Generate a SHORT image-editing instruction (10-30 words).
Goal: restyle foreground text to harmonize with the background
while keeping the background untouched.
Do NOT move, resize, or alter any text content or position.

Background style: {background_style}
Dominant colors: {colors}
Text style hint: {hint}

Output ONLY the instruction in English, 10-30 words.
"""

Terminology

Visual Text GenerationThe task of generating images that contain accurate, readable text — notoriously hard for diffusion models.

CJK CharactersChinese, Japanese, and Korean characters — a large, complex character set that most text-to-image models handle poorly.

Agent PipelineA multi-step process orchestrated by an AI agent that breaks complex tasks into subtasks handled by specialized components.

Layout PlanningDetermining where text elements should be positioned in an image before rendering them.

Related Resources

GlyphBanana GitHub Repository

Original Abstract (Expand)

Despite recent advances in generative models driving significant progress in text rendering, accurately generating complex text and mathematical formulas remains a formidable challenge. This difficulty primarily stems from the limited instruction-following capabilities of current models when encountering out-of-distribution prompts. To address this, we introduce GlyphBanana, alongside a corresponding benchmark specifically designed for rendering complex characters and formulas. GlyphBanana employs an agentic workflow that integrates auxiliary tools to inject glyph templates into both the latent space and attention maps, facilitating the iterative refinement of generated images. Notably, our training-free approach can be seamlessly applied to various Text-to-Image (T2I) models, achieving superior precision compared to existing baselines. Extensive experiments demonstrate the effectiveness of our proposed workflow. Associated code is publicly available at https://github.com/yuriYanZeXuan/GlyphBanana.