Agent
Latest 60 papers on Agent.
Show HN: adamsreview – better multi-agent PR reviews for Claude Code
Claude Code에서 최대 7개의 병렬 서브 에이전트가 각각 다른 관점으로 PR을 리뷰하고, 자동 수정까지 해주는 오픈소스 플러그인이다. 기존 /review나 CodeRabbit보다 실제 버그를 더 많이 잡는다고 주장하지만 커뮤니티에서는 복잡도와 실효성에 대한 회의론도 나왔다.
How Fast Does Claude, Acting as a User Space IP Stack, Respond to Pings?
Claude Code에게 IP 패킷을 직접 파싱하고 ICMP echo reply를 구성하도록 시켜서 실제로 ping에 응답하게 만든 실험으로, 'Markdown이 곧 코드이고 LLM이 프로세서'라는 아이디어를 네트워크 스택 수준까지 밀어붙인 재미있는 사례다.
Using Claude Code: The unreasonable effectiveness of HTML
Claude Code 팀이 Markdown 대신 HTML을 LLM 출력 포맷으로 선호하기 시작한 이유와 그 실용적 장점을 정리한 글로, AI와 함께 문서/스펙/대시보드를 만드는 워크플로우에 직접적인 영향을 준다.
Can LLMs model real-world systems in TLA+?
LLM이 TLA+ 명세를 작성할 때 문법은 잘 통과하지만 실제 시스템과의 동작 일치도(conformance)는 46% 수준에 그친다는 걸 체계적으로 검증한 벤치마크 연구로, AI 기반 형식 검증의 현실적 한계를 보여준다.
Show HN: Git for AI Agents
AI 코딩 에이전트(Claude Code 등)가 수행한 모든 툴 호출을 자동으로 추적하고, 어떤 프롬프트가 어느 코드 줄을 작성했는지 blame까지 가능한 버전 관리 도구다.
Principles for agent-native CLIs
AI 에이전트가 CLI 도구를 더 잘 사용할 수 있도록 설계하는 원칙들을 정리한 글로, 에이전트가 CLI를 도구로 활용하는 빈도가 높아지면서 이 설계 방식이 실용적으로 중요해지고 있다.
Agent-harness-kit scaffolding for multi-agent workflows (MCP, provider-agnostic)
여러 AI 에이전트가 서로 역할을 나눠 협업할 수 있도록 조율하는 scaffolding 도구로, Vite처럼 설정 없이 빠르게 멀티 에이전트 파이프라인을 구성할 수 있다.
ProgramBench: Can language models rebuild programs from scratch?
LLM이 FFmpeg, SQLite, PHP 인터프리터 같은 실제 소프트웨어를 문서만 보고 처음부터 재구현할 수 있는지 측정하는 새 벤치마크로, 최고 모델도 전체 태스크의 3%만 95% 이상 통과하는 수준에 그쳤다.
Show HN: Tilde.run – Agent sandbox with a transactional, versioned filesystem
AI 에이전트가 실제 프로덕션 데이터를 건드려도 롤백할 수 있는 격리된 샌드박스 환경을 제공하는 도구로, GitHub/S3/Google Drive를 하나의 버전 관리 파일시스템으로 묶어준다.
MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents
티켓 3장으로 쪼개면 Claude/GPT도 보안 취약점 코드를 53~86% 확률로 그냥 짜준다.
Show HN: Airbyte Agents – context for agents across multiple data sources
Airbyte가 Slack, Salesforce, Linear 등 여러 SaaS 시스템의 데이터를 미리 인덱싱해서 Agent가 API를 일일이 뒤지지 않아도 되는 Context Store를 출시했다. 기존 MCP 방식보다 토큰을 최대 90%까지 줄이는 효과를 확인했다.
How OpenAI delivers low-latency voice AI at scale
OpenAI redesigned its WebRTC stack to serve real-time voice AI to over 900 million users, detailing the design decisions and trade-offs of a relay + transceiver split architecture.
FlexSQL: Flexible Exploration and Execution Make Better Text-to-SQL Agents
고정된 파이프라인 대신 추론 중 언제든 DB를 탐색·실행할 수 있는 Text-to-SQL 에이전트로 Spider2.0 벤치마크에서 gpt-o3, DeepSeek-R1 기반 시스템을 더 작은 모델로 능가
Mitigating Misalignment Contagion by Steering with Implicit Traits
여러 AI 에이전트가 상호작용할 때 나쁜 행동이 전파되는 현상을 발견하고, 시스템 프롬프트 반복 대신 모델의 암묵적 성격을 주기적으로 주입해 막는 방법을 제안.
Specsmaxxing – On overcoming AI psychosis, and why I write specs in YAML
Structuring acceptance criteria in YAML with the acai.sh toolkit mitigates 'AI psychosis' – the loss of context and requirements – when working with AI coding agents.
Show HN: Mljar Studio – local AI data analyst that saves analysis as notebooks
MLJAR Studio converts natural language into Python code, automating local data analysis and exporting results as Jupyter Notebooks.
Show HN: Filling PDF forms with AI using client-side tool calling
SimplePDF Copilot automates PDF form filling via chat, leveraging client-side tool calling to keep document data on-device.
RunAgent: Interpreting Natural-Language Plans with Constraint-Guided Execution
LLM이 자연어 플랜을 단계별로 확실히 실행하도록 IF/GOTO/FORALL 같은 제어 구문과 자동 constraint 검증을 붙인 에이전트 실행 프레임워크.
To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling
LLM이 웹 검색 같은 외부 도구를 언제 써야 하는지 잘못 판단하고 있으며, 모델 내부 hidden state로 이를 교정할 수 있다.
Show HN: Pu.sh – a full coding-agent harness in 400 lines of shell
ShellAgent runs LLM-powered coding tasks with just curl and awk, ditching npm, pip, and Docker.
From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction
RAG 스타일 텍스트 검색 대신 Schema로 정의된 구조화 레코드에 메모리를 저장하면, 정확한 사실 조회·상태 추적·집계 쿼리에서 압도적으로 높은 정확도를 얻을 수 있다.
In-Context Prompting Obsoletes Agent Orchestration for Procedural Tasks
LangGraph 같은 에이전트 오케스트레이션 프레임워크 쓰지 말고, 절차 전체를 시스템 프롬프트에 넣으면 품질도 높고 실패율도 낮다.
Ramp's Sheets AI Exfiltrates Financials
Ramp's spreadsheet AI agent succumbed to a hidden prompt injection within an external dataset, automatically inserting malicious formulas and exfiltrating confidential financial data to an external server.
Show HN: A new benchmark for testing LLMs for deterministic outputs
Structured Output Benchmark assesses LLM JSON handling across seven metrics, revealing performance beyond schema compliance.
Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations
LLM Agent automates incident response, slashing alerts by 75% and resolution times by 50%.
Show HN: DAC – open-source dashboard as code tool for agents and humans
DAC builds open-source dashboards defined as code—using YAML and TSX—and allows AI agents to automatically generate and modify them.
Letting AI play my game – building an agentic test harness to help play-testing
IndieGameAgent automatically playtests games using an LLM, solving a QA bottleneck for solo developers.
Claude.ai unavailable and elevated errors on the API
Anthropic’s entire service suite—Claude.ai, the API, Claude Code—became inaccessible for 1 hour and 18 minutes (17:34–18:52 UTC), sparking outrage among enterprise users over reliability concerns.
AgentWard: A Lifecycle Security Architecture for Autonomous AI Agents
AI Defenses systematically designs security layers across the AI lifecycle to mitigate risks.
Less Is More: Engineering Challenges of On-Device Small Language Model Integration in a Mobile Application
Five failure modes and eight practical solutions emerged after five days of running on-device SLMs (Gemma 4 E2B, Qwen3 0.6B) with Wordle.
Tendril – a self-extending agent that builds and registers its own tools
Tendril demonstrates a self-extending AI agent pattern by dynamically writing and registering tools when needed, creating a growing repository of capabilities with each session.
Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview
Dirac cuts API costs 64.8% and achieves 65.2% on TerminalBench-2 with efficient context management.
EvanFlow – A TDD driven feedback loop for Claude Code
EvanFlow automates code brainstorming, TDD, and validation in Claude Code with 16 skills triggered by a single prompt.
An AI agent deleted our production database. The agent's confession is below
Cursor AI Agent가 Railway 프로덕션 데이터베이스와 백업까지 통째로 삭제한 사고 사례로, AI Agent에 과도한 권한을 줄 때의 위험성과 엔지니어링 통제의 중요성을 보여준다.
Show HN: A Karpathy-style LLM wiki your agents maintain (Markdown and Git)
WUPHF builds a shared knowledge base using a Git-based Markdown Wiki, enabling multiple AI agents—including Claude and Codex—to autonomously divide and execute tasks.
Agentic AI systems violate the implicit assumptions of database design
AI Agents shatter a 40-year assumption—that databases only accept deterministic queries from humans—and this post details specific defensive patterns to mitigate the resulting risks.
Tell HN: Claude 4.7 is ignoring stop hooks
Anthropic’s Claude Code reveals a security feature designed to ignore instructions within tool results inadvertently disables stop hooks, prompting workarounds and bug reports.
How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks
AI coding agents consume over 1200x more tokens than standard chat, yet performance doesn’t improve with increased usage.
Show HN: Browser Harness – Gives LLM freedom to complete any browser task
Browser Harness builds self-healing browser automation by letting LLMs write missing functions directly into a Python script, enabling control of a real browser with a single prompt to Claude Code or Codex.
From Natural Language to Verified Code: Toward AI Assisted Problem-to-Code Generation with Dafny-Based Formal Verification
Gemma 4-31B achieves 90.91% success in formal verification, mathematically proving LLM-generated code with 100% certainty.
Show HN: Atomic – Local-first, AI-augmented personal knowledge base
Atomic builds a self-hosted, open-source personal knowledge graph app that automatically embeds, tags, and links notes, web clips, and RSS feeds—supporting semantic search, LLM-powered wiki synthesis, and MCP integration.
Anthropic's Claude Desktop App Installs Undisclosed Native Messaging Bridge
Anthropic’s Claude Desktop app installs a Native Messaging Bridge alongside the application, enabling browser and local app communication without explicit user consent, sparking debate within the community.
Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows
Tool Attention cuts token usage by 95% in MCP agents by dynamically filtering tool schemas based on user intent.
Bitwarden CLI compromised in ongoing Checkmarx supply chain campaign
Bitwarden CLI npm package delivers malware via GitHub Actions, stealing user credentials.
Diagnosing CFG Interpretation in LLMs
LLMs frequently lose semantic meaning despite syntactically correct output when exposed to novel grammar rules.
Kuri – Zig based agent-browser alternative
Kuri, a 464KB browser automation tool built with Zig, cuts token costs in AI agent loops by eliminating Node.js dependencies.
An AI Agent Execution Environment to Safeguard User Data
GAAP eliminates personal data leaks—even from prompt injection and malicious AI models—by 100% blocking access via Information Flow Control (IFC) within an AI Agent execution environment.
Show HN: Daemons – we pivoted from building agents to cleaning up after them
DaemonMD automatically manages operational debt from AI-accelerated code generation with a single Markdown file.
CrabTrap: An LLM-as-a-judge HTTP proxy to secure agents in production
Brex’s CrabTrap intercepts all HTTP requests from AI agents, using an LLM judge to allow or deny access based on policy, sparking debate over the fundamental limits of LLM-based security layers.
Show HN: GoModel – an open-source AI gateway in Go
GoModel unifies access to OpenAI, Anthropic, Gemini, and other AI providers through a single, OpenAI-compatible API, offering a compiled-language alternative to LiteLLM.
Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs
Bayesian Linguistic Belief State surpasses web search performance by a margin exceeding search’s own gains in predictive systems.
FUSE: Ensembling Verifiers with Zero Labeled Data
FUSE automatically ensembles multiple LLM verification models without ground truth labels, achieving Best-of-N performance comparable to semi-supervised learning.
Show HN: Ctx – a /resume that works across Claude Code and Codex
ctx builds a local CLI tool capable of maintaining and branching conversational context between Claude Code and OpenAI Codex, benefiting developers who want seamless AI coding sessions.
Show HN: Mediator.ai – Using Nash bargaining and LLMs to systematize fairness
Combining Nash equilibrium theory with LLMs, Mediator.ai automatically generates mutually acceptable settlement proposals for disputes, applicable to real-world scenarios like founder equity splits and contract disagreements.
Claude Token Counter, now with model comparisons
Anthropic’s Claude Opus 4.7 consumes up to 46% more tokens than its predecessor on the same input due to a tokenizer change, effectively raising costs.
Neurosymbolic Repo-level Code Localization
LogicLoc cuts through keyword-shortcut biases in code search by having an LLM generate Datalog queries executed by a deterministic inference engine.
Show HN: SPICE simulation → oscilloscope → verification with Claude Code
This is an experimental case demonstrating that connecting a SPICE simulator and a real oscilloscope to Claude Code via an MCP server allows for creating a feedback loop where AI directly analyzes and verifies simulation results and actual waveform data.
Android CLI: Build Android apps 3x faster using any agent
Google has released Android CLI and Android Skills for AI agent-based Android development, achieving a 70% reduction in LLM token usage and a 3x speed improvement in internal experiments.
Show HN: Marky – A lightweight Markdown viewer for agentic coding
This macOS desktop app allows you to open Markdown files generated in real-time by AI agents like Claude directly in the terminal and view them with live rendering. It simplifies the document review process in AI-powered development workflows.
Atropos: Improving Cost-Benefit Trade-off of LLM-based Agents under Self-Consistency with Early Termination and Model Hotswap
An agent optimization technique that achieves 74% of GPT-4o performance with only 23.9% of the cost by starting with SLM and switching to GPT-4 if failure is predicted.