Letting AI play my game – building an agentic test harness to help play-testing | AI Paper Digest

TL;DR Highlight

IndieGameAgent automatically playtests games using an LLM, solving a QA bottleneck for solo developers.

Who Should Read

Solo indie game developers, or those building applications with text-based interfaces, seeking to automate testing environments with AI agents.

Core Mechanics

Despite the original post being inaccessible due to Vercel security restrictions, community comments confirm the author built an 'agentic test harness' where an LLM directly plays and tests the game.
The core idea involves a separate text-only renderer that converts game state into text, allowing the LLM to understand the game without 'seeing' the screen visually.
This text renderer approach is praised as an ingenious design, circumventing the 'visual grounding problem' where AI must analyze screenshots or DOMs.
The architecture leverages MCP (Model Context Protocol) to enable the agent to directly access and manipulate the game's actual state.
This approach mirrors E2E testing, but with an LLM agent as the tester, uncovering unexpected bugs and game balance issues without pre-defined scripts.
The community shared that starting the agent with only CLI usage instructions—without prior game context—provides a fresh perspective akin to 'rubber-ducking' debugging.
Real-world experience shows that using agents enables a workflow where features are implemented and E2E tests are self-verified while the developer is away.

Evidence

"Some suggested using a Monte Carlo headless simulator instead of an LLM, citing speed and cost advantages for deterministic games with parallelizable simulations. A developer testing AI on a real-time physics-based 2D game found browser MCP impractical due to objects flying off-screen before AI could capture screenshots, opting for a hybrid API. An E2E web test user shared a token optimization tip: switching from raw DOM to accessibility-tree references reduced token usage tenfold and improved agent accuracy. Another user found that providing agents with both source code and live browser snapshots simultaneously maximized test quality, avoiding false positives from code-only or browser-only approaches. A user connecting an MCP server to a MUD saw Claude Code agents collaboratively building new sections in separate windows, while a team introducing agents to a Pokémon-style MMORPG received negative feedback—'I won't waste precious tokens playing a game'."

How to Apply

"If building a text-based or turn-based game, completely separate game logic and rendering, creating a dedicated renderer to serialize game state into text. This simplifies building an agentic test harness by eliminating visual processing requirements. For non-real-time, deterministic games, consider a Monte Carlo simulation instead of costly LLMs for faster, more efficient balance tuning. To reduce token costs in LLM-based testing, provide structured text—like accessibility-tree references or key state values—instead of raw browser or game state. If you want the agent to self-verify implementations, instruct it to 'write E2E tests and confirm with screenshots' during code generation, enabling autonomous implementation-verification loops."

Code Example

snippet

// Example architecture pattern mentioned in the community

// 1. Separate renderer to serialize game state to text
function textRenderer(gameState) {
  return [
    `Turn: ${gameState.turn}`,
    `Player HP: ${gameState.player.hp}/${gameState.player.maxHp}`,
    `Location: ${gameState.currentRoom.name}`,
    `Available actions: ${gameState.availableActions.join(', ')}`,
    `Inventory: ${gameState.player.inventory.map(i => i.name).join(', ')}`,
  ].join('\n');
}

// 2. in-process MCP server pattern (ECS/Fargate environment without stdio process boundaries)
// create_sdk_mcp_server + @tool decorator style
// Maintain browser handle within tool definition scope

// 3. Token saving with accessibility-tree based references
// raw DOM (token waste):
// <div id="enemy-hp-bar" class="hp-bar" data-value="80" ...>
// accessibility-tree reference (token saving):
// e1: [button] "Attack" e2: [button] "Flee" e3: [text] "Enemy HP: 80/100"

Terminology

Agentic Test HarnessAn automated testing environment where an AI agent interacts with software as a human would, exploring various scenarios independently of scripted tests.

MCP (Model Context Protocol)A protocol enabling AI models to use external tools (file systems, browsers, game APIs) in a standardized way, connecting AI and tools with a defined interface.

Monte Carlo SimulationA method using thousands to millions of random simulations to estimate probabilistic outcomes, used in game balance testing to identify issues and average win rates.

HeadlessA mode where software runs in the background without a graphical user interface (GUI), enabling faster simulations by running game logic without rendering.

Visual GroundingThe ability of AI to accurately understand what it 'sees' on a screen. Issues arise when AI misinterprets game UI elements from screenshots alone.

GOAP (Goal-Oriented Action Planning)A technique for AI characters to automatically plan actions to achieve goals. A traditional AI approach for game NPCs, distinct from LLMs.

Letting AI play my game – building an agentic test harness to help play-testing