GlyphBanana: Advancing Precise Text Rendering Through Agentic Workflows
TL;DR Highlight
A training-free agent pipeline that accurately renders text in images — even mathematical formulas and rare CJK characters.
Who Should Read
Developers building image generation pipelines that need accurate text rendering (design tools, document generation, localization), and researchers working on visual text generation.
Core Mechanics
- Existing text-to-image models fail badly at rendering accurate text, especially math formulas and rare characters
- Proposed an agent pipeline that decomposes text rendering into: layout planning, character-level rendering, and composition
- No training or fine-tuning required — works by orchestrating existing tools and models
- Handles complex typography challenges: mathematical symbols, CJK characters, mixed scripts
- The pipeline produces images with accurately placed, correctly rendered text overlays
- Agent dynamically selects appropriate rendering strategies based on text content type
Evidence
- Significantly outperforms base text-to-image models on text accuracy metrics for both Latin and CJK scripts
- Correctly renders mathematical formulas that base models completely fail on
- Zero-shot generalization to rare characters not seen during development
- User studies confirm perceived quality improvement for text-heavy image generation tasks
How to Apply
- Use this pipeline for any image generation task requiring accurate text — replace direct model inference with the agent pipeline
- The modular design lets you swap in different base models without changing the text rendering logic
- For math-heavy content (textbooks, scientific figures), this approach is currently the most reliable training-free solution
Code Example
# GlyphBanana Typography Analysis Prompt (can be used directly with VLM)
typography_analysis_prompt = """
You are an expert in image typography analysis.
Given a reference image with a 5x5 grid and coordinate annotations,
analyze the natural text rendering style and overall scene.
Then plan the best typography layout for each text/formula item.
Critical constraints:
- Bounding boxes must remain flat and frontal (no perspective distortion)
- Red grid lines are positioning aids only — ignore them for style description
- Grid coordinates: {0.0, 0.2, 0.4, 0.6, 0.8, 1.0} on each axis
Per-region output fields:
- content: target text or formula
- bbox: [xmin, ymin, xmax, ymax] in [0,1]
- font: from registered font list or 'auto'
- font_weight: light/regular/bold
- font_size_ratio: scalar in [0.1, 1.0]
- color: white/black/red/blue/green/yellow/orange/brown/gray/gold/silver/purple/pink
- is_latex: boolean
- alignment: left/center/right
- rotation: degrees (0 = horizontal)
Return strict JSON with keys: image_analysis, text_regions
"""
# Generate Clean Prompt (keep only background, remove text instructions)
clean_prompt_template = """
Remove ALL quoted text, formulas, and text-rendering instructions from the prompt.
Keep ONLY the scene/background/style description.
Add 'no text visible' at the end.
Example:
Input: A classroom blackboard displays "E=mc²" in elegant chalk writing.
Output: An empty classroom blackboard as background, clear and without any text. No text visible.
Input prompt: {user_prompt}
Output ONLY the cleaned prompt, nothing else.
"""
# Generate Style Prompt (for text-background harmonization)
style_prompt_template = """
Generate a SHORT image-editing instruction (10-30 words).
Goal: restyle foreground text to harmonize with the background
while keeping the background untouched.
Do NOT move, resize, or alter any text content or position.
Background style: {background_style}
Dominant colors: {colors}
Text style hint: {hint}
Output ONLY the instruction in English, 10-30 words.
"""Terminology
Related Resources
Original Abstract (Expand)
Despite recent advances in generative models driving significant progress in text rendering, accurately generating complex text and mathematical formulas remains a formidable challenge. This difficulty primarily stems from the limited instruction-following capabilities of current models when encountering out-of-distribution prompts. To address this, we introduce GlyphBanana, alongside a corresponding benchmark specifically designed for rendering complex characters and formulas. GlyphBanana employs an agentic workflow that integrates auxiliary tools to inject glyph templates into both the latent space and attention maps, facilitating the iterative refinement of generated images. Notably, our training-free approach can be seamlessly applied to various Text-to-Image (T2I) models, achieving superior precision compared to existing baselines. Extensive experiments demonstrate the effectiveness of our proposed workflow. Associated code is publicly available at https://github.com/yuriYanZeXuan/GlyphBanana.