Letting AI play my game – building an agentic test harness to help play-testing
TL;DR Highlight
IndieGameAgent automatically playtests games using an LLM, solving a QA bottleneck for solo developers.
Who Should Read
Solo indie game developers, or those building applications with text-based interfaces, seeking to automate testing environments with AI agents.
Core Mechanics
- Despite the original post being inaccessible due to Vercel security restrictions, community comments confirm the author built an 'agentic test harness' where an LLM directly plays and tests the game.
- The core idea involves a separate text-only renderer that converts game state into text, allowing the LLM to understand the game without 'seeing' the screen visually.
- This text renderer approach is praised as an ingenious design, circumventing the 'visual grounding problem' where AI must analyze screenshots or DOMs.
- The architecture leverages MCP (Model Context Protocol) to enable the agent to directly access and manipulate the game's actual state.
- This approach mirrors E2E testing, but with an LLM agent as the tester, uncovering unexpected bugs and game balance issues without pre-defined scripts.
- The community shared that starting the agent with only CLI usage instructions—without prior game context—provides a fresh perspective akin to 'rubber-ducking' debugging.
- Real-world experience shows that using agents enables a workflow where features are implemented and E2E tests are self-verified while the developer is away.
Evidence
- "Some suggested using a Monte Carlo headless simulator instead of an LLM, citing speed and cost advantages for deterministic games with parallelizable simulations. A developer testing AI on a real-time physics-based 2D game found browser MCP impractical due to objects flying off-screen before AI could capture screenshots, opting for a hybrid API. An E2E web test user shared a token optimization tip: switching from raw DOM to accessibility-tree references reduced token usage tenfold and improved agent accuracy. Another user found that providing agents with both source code and live browser snapshots simultaneously maximized test quality, avoiding false positives from code-only or browser-only approaches. A user connecting an MCP server to a MUD saw Claude Code agents collaboratively building new sections in separate windows, while a team introducing agents to a Pokémon-style MMORPG received negative feedback—'I won't waste precious tokens playing a game'."
How to Apply
- "If building a text-based or turn-based game, completely separate game logic and rendering, creating a dedicated renderer to serialize game state into text. This simplifies building an agentic test harness by eliminating visual processing requirements. For non-real-time, deterministic games, consider a Monte Carlo simulation instead of costly LLMs for faster, more efficient balance tuning. To reduce token costs in LLM-based testing, provide structured text—like accessibility-tree references or key state values—instead of raw browser or game state. If you want the agent to self-verify implementations, instruct it to 'write E2E tests and confirm with screenshots' during code generation, enabling autonomous implementation-verification loops."
Code Example
// Example architecture pattern mentioned in the community
// 1. Separate renderer to serialize game state to text
function textRenderer(gameState) {
return [
`Turn: ${gameState.turn}`,
`Player HP: ${gameState.player.hp}/${gameState.player.maxHp}`,
`Location: ${gameState.currentRoom.name}`,
`Available actions: ${gameState.availableActions.join(', ')}`,
`Inventory: ${gameState.player.inventory.map(i => i.name).join(', ')}`,
].join('\n');
}
// 2. in-process MCP server pattern (ECS/Fargate environment without stdio process boundaries)
// create_sdk_mcp_server + @tool decorator style
// Maintain browser handle within tool definition scope
// 3. Token saving with accessibility-tree based references
// raw DOM (token waste):
// <div id="enemy-hp-bar" class="hp-bar" data-value="80" ...>
// accessibility-tree reference (token saving):
// e1: [button] "Attack" e2: [button] "Flee" e3: [text] "Enemy HP: 80/100"Terminology
Related Papers
Show HN: adamsreview – better multi-agent PR reviews for Claude Code
Claude Code에서 최대 7개의 병렬 서브 에이전트가 각각 다른 관점으로 PR을 리뷰하고, 자동 수정까지 해주는 오픈소스 플러그인이다. 기존 /review나 CodeRabbit보다 실제 버그를 더 많이 잡는다고 주장하지만 커뮤니티에서는 복잡도와 실효성에 대한 회의론도 나왔다.
How Fast Does Claude, Acting as a User Space IP Stack, Respond to Pings?
Claude Code에게 IP 패킷을 직접 파싱하고 ICMP echo reply를 구성하도록 시켜서 실제로 ping에 응답하게 만든 실험으로, 'Markdown이 곧 코드이고 LLM이 프로세서'라는 아이디어를 네트워크 스택 수준까지 밀어붙인 재미있는 사례다.
Show HN: Git for AI Agents
AI 코딩 에이전트(Claude Code 등)가 수행한 모든 툴 호출을 자동으로 추적하고, 어떤 프롬프트가 어느 코드 줄을 작성했는지 blame까지 가능한 버전 관리 도구다.
Principles for agent-native CLIs
AI 에이전트가 CLI 도구를 더 잘 사용할 수 있도록 설계하는 원칙들을 정리한 글로, 에이전트가 CLI를 도구로 활용하는 빈도가 높아지면서 이 설계 방식이 실용적으로 중요해지고 있다.
Agent-harness-kit scaffolding for multi-agent workflows (MCP, provider-agnostic)
여러 AI 에이전트가 서로 역할을 나눠 협업할 수 있도록 조율하는 scaffolding 도구로, Vite처럼 설정 없이 빠르게 멀티 에이전트 파이프라인을 구성할 수 있다.
Show HN: Tilde.run – Agent sandbox with a transactional, versioned filesystem
AI 에이전트가 실제 프로덕션 데이터를 건드려도 롤백할 수 있는 격리된 샌드박스 환경을 제공하는 도구로, GitHub/S3/Google Drive를 하나의 버전 관리 파일시스템으로 묶어준다.