t=0.000s · session_start · command=uvx mcp-chaos run

Your agent works.
Until a tool fails.

The error is obvious — your agent's reaction isn't. The same injected timeout stops one model after $0.04 and sends another into a $2.28 runaway. mcp-chaos is a transparent MCP proxy that makes tool failures happen deterministically — the exact tool, call, and failure mode you choose — and records what your agent does next. One line of MCP config. No SDK, no code, any MCP client.

Get it on GitHub Read the usage guide

Replay of a recorded run: one injected write_file timeout costs Claude Code 4 blind retries, 12 turns, 89 seconds and $1.01 — A real recorded run: one injected `write_file` timeout vs. headless Claude Code — 4 blind retries, 12 turns, 89 s, $1.01 burned. Full experiment →

t=8.306s · fault · tool=write_file · type=timeout

The test nothing else in your stack runs

Production tools time out, rate-limit, return empty or poisoned data — and most production agent incidents come from these tool-call failures, not from the model being wrong. When it happens, does your agent retry sanely, loop and burn money, re-run a payment it already made, or tell you "done" when nothing happened? Right now you find out in production.

Layer	The question it answers	When you learn
Evals	Does the agent do the task right on good inputs?	pre-ship
Observability	What did the agent do?	after it broke
mcp-chaos	How does the agent behave while its tools are failing?	pre-ship

t=9.1s … t=97.4s · tool_call · write_file · ×4 retries, same dead tool

One timeout, six models — a 50× cost spread

Same injected write_file timeout, same task, six models. The more capable the model, the more it spent fighting a tool that was never going to work:

Model	Agent	Retries	Verdict	Time	Cost	Outcome
Haiku 4.5	Claude Code	2	retried	18 s	$0.04	failed (honest)
Sonnet	Claude Code	2	retried	13 s	$0.11	failed (honest)
Opus	Claude Code	4	runaway	58 s	$0.60	failed (honest)
Fable 5	Claude Code	3	runaway	172 s	$2.28	succeeded via workaround
gpt-5.5	Codex	1*	—	8 s	—	failed (honest)
gpt-5.4-mini	Codex	2*	—	15 s	—	failed (honest)

N=1 per model — a seed, not a robust benchmark; the OpenAI rows are preliminary. Read the fine print →

same fault · minimal agent · 12 models × 8 vendors × 5 runs

Then we held the harness constant and varied the model

A minimal ~150-line reference agent (no scaffolding, so the behavior is the model's) took the same permanent write_file timeout across 12 models, 5 runs each — 60 runs, $1.09. No run could truly succeed. Repetition turns anecdotes into rates:

Model	Runaway rate	Avg retries	Never answered	False success	Avg cost
meta-llama/llama-4-maverick	0%	0.0	0%	0/5	$0.0015
mistralai/mistral-large-2512	0%	0.2	0%	1/5	$0.0011
qwen/qwen3-235b-a22b-2507	0%	1.0	0%	1/5	$0.0015
openai/gpt-5.1	0%	1.4	0%	0/5	$0.0041
x-ai/grok-4.3	20%	2.0	0%	0/5	$0.0103
openai/gpt-5-mini	80%	2.8	0%	0/5	$0.0050
moonshotai/kimi-k2.6	80%	4.0	80%	0/5	$0.0127
google/gemini-3-flash-preview	100%	3.8	100%	0/5	$0.0107
deepseek/deepseek-v4-flash	100%	4.2	100%	0/5	$0.0020
anthropic/claude-haiku-4.5	100%	4.8	60%	0/5	$0.0459
z-ai/glm-5	100%	4.8	100%	0/5	$0.0064
anthropic/claude-sonnet-5	100%	5.0	100%	0/5	$0.1175

Retry discipline is a stable model trait — four models never looped across 5 runs, five looped on every run. Not noise: it repeats.
Two models lied. mistral-large and qwen3-235b each reported "Task SUCCEEDED" on 1 of 5 runs — over a sandbox their own tool call had shown empty. The claimed-success bug is a ~20% rate in two models, not a one-off.
"0% runaway" can be a trap — llama and mistral score it by giving up instantly (sometimes with the wrong diagnosis), the opposite of gpt-5.1's disciplined 0%.
Cost of one failure spans ~100× ($0.0011 → $0.1175) and says nothing about handling it well.

N=5 — rough rates, not a precise ranking; minimal harness (real clients scaffold more); one task, one fault type. Full method, per-run logs and caveats →

proxy · relays everything · tampers only with what you tell it to

How it works

Agent (Claude Code / Cursor / yours)
        │  MCP
        ▼
   ┌───────────┐    faults.yaml: timeout, 429, garbage JSON,
   │ mcp-chaos │◄── empty results, slow drip, injected text
   └───────────┘
        │  MCP
        ▼
   Real MCP server (GitHub, filesystem, DB, ...)

Seven fault types — timeout, error, rate_limit, slow, empty, corrupt, inject (indirect prompt injection) — matched by tool name, call count, or probability. Every event lands in a JSONL log and renders as a single-file HTML report with a deterministic resilience verdict: no LLM judging, rules you can audit.

What one dead tool costs you — retries, wall-clock, dollars burned.
Whether your agent loops — runaway-retry detection.
Whether it blindly re-runs writes — the pattern that double-charges cards.
Whether it follows poisoned tool output — injection resilience.
Your agent can run the whole test itself — via the shipped agent skill: just ask it to "chaos-test my MCP setup".

The HTML report: injected write_file timeout, 4 blind retries, runaway verdict, full event timeline — The report from the run above: the fault, the 4 blind retries, the runaway verdict — with the full event timeline as evidence.

t=0.706s · tools_list · count=2 · ~1163 tokens of definitions

Not just chaos: profile any MCP server

With faults: [] the proxy is a pure relay, and the report profiles how your agent actually uses the server: the context-token cost of tool definitions, tools you load but never call, per-tool latency and result sizes, and calls the agent had to re-issue with corrected arguments — a confusing schema, made measurable.

MCP efficiency report from a real zero-fault run against the Context7 MCP server — A real zero-fault run through the Context7 docs server: even a 2-tool server loads ~1.2k tokens of definitions into every session, and each call returns another ~0.5–1k tokens.

Measured: the context tax of 7 popular servers

Same probe (initialize + tools/list through the proxy, no LLM), run on macOS/arm64 and Fedora/x86_64 with byte-identical results — the tax is a property of the server, not your machine:

Server	Tools	~Tokens / session	~Tokens / tool
@playwright/mcp	23	4,365	190
server-filesystem	14	3,068	219
server-memory	9	2,463	274
server-everything	12	1,457	121
context7-mcp	2	1,163	581
server-sequential-thinking	1	1,137	1,137
mcp-server-fetch	1	287	287
All seven together	62	≈13,940	—

Every session with all seven wired in starts ~14k tokens deep before the first user message — and in a companion agent run, 11 of the filesystem server's 14 definitions went unused. chars/4 estimates; method, caveats and raw data in the experiment writeup →

2026-07-03 · five capabilities · real captured output, raw logs committed

What you get — measured, not promised

Every block below is byte-for-byte output from a real run made today; the raw logs, configs and transcripts are committed under docs/experiments/.

Config doctor — `mcp-chaos doctor .mcp.json`

A five-server config, checked in about a second, no agent involved:

✔ filesystem: 14 tools · ~3214 tokens of definitions · ready in 546 ms
✔ repo-fs: 14 tools · ~3214 tokens of definitions · ready in 536 ms
✔ context7: 2 tools · ~1171 tokens of definitions · ready in 638 ms
✘ github: launch failed: [Errno 2] No such file or directory: 'github-mcp-server-not-installed'
- sentry: skipped — HTTP transport not checked yet (https://mcp.sentry.dev/mcp)
⚠ tool name collision: write_file (filesystem, repo-fs)
… 13 more collision lines …
5 server(s) · 30 tools · ~7599 tokens of tool definitions per session
1 problem(s) found

One broken server and 14 silent tool-name collisions caught before any agent burns a session on them, plus the per-session context price of the whole config — exit code 1, so it drops straight into CI. Artifacts →

Record & replay — hermetic tool mocks

We recorded one real headless-Claude-Code session against the official filesystem server with --cassette, then deleted the sandbox directory and re-ran the same task against mcp-chaos replay — no server, no filesystem. The agent completed it identically:

Done. 

**Results:**
1. **Allowed directory:** `/private/tmp/chaos-replay`
2. **File created:** `/private/tmp/chaos-replay/notes.txt`
3. **File contents:** `cassette demo`
$ ls /tmp/chaos-replay
"/tmp/chaos-replay": No such file or directory (os error 2)

Record once against the real backend, replay in CI forever: deterministic, offline, $0 in tool-side cost — swapped in with the same one-line config change. Artifacts →

Hosted servers — same faults over Streamable HTTP

Point server.url at a production MCP service you don't run (here: Context7's hosted endpoint) and inject the same faults:

server: {"name": "Context7", "version": "3.2.2", "websiteUrl": "https://context7.com", "description": "Context7 provides up-to-d ...
tools: ['resolve-library-id', 'query-docs']
resolve-library-id (live, un-faulted): Available Libraries: ...
query-docs (fault injected): {"code": -32603, "message": "Internal error: service unavailable"}

The un-faulted call hit the live backend; the faulted one never left the proxy. You can now rehearse the outage of a hosted dependency you could never make fail on demand. Artifacts →

Duplicate-write detection — `--fail-on duplicate-write`

A slow fault pushed write_file past headless Claude Code's 5 s tool timeout. The agent re-sent the identical write three times and reported "the operation to create status.txt could not be completed" — while the file sat on disk, written. The gate:

mcp-chaos: wrote /tmp/chaos-dup-report.html
mcp-chaos: 4 tool calls · 3 faults injected
mcp-chaos: FAIL write_file (args eab66c28): replayed_after_fault, sent 3x, executed ok 0x

"Timeout" does not mean "didn't happen" — this is the retry pattern that double-charges a card, caught as a CI exit code from pure protocol traffic. Artifacts →

Transcript correlation — `mcp-chaos correlate`

We re-ran the dead-write_file scenario until a model lied. qwen3-235b, run 14, final answer: "The file was successfully created in the allowed directory. The task SUCCEEDED." — over an empty sandbox:

tool write_file: fault(s) timeout — never succeeded
final answer: claims success · no failure language
verdict: claimed_success

The run log alone proves the tool never worked; the transcript alone reads like a success story. The join catches the lie — deterministically, with two documented regexes — and --fail-on claimed-success exits 1. The lie is probabilistic (0 in the first 10 runs, then 1 in 4), which is exactly why it's a CI gate and not a manual test. Artifacts →

t=now · tool_result · ok=true · your turn

Quickstart

# 1. Describe the faults to inject
cat > faults.yaml <<'EOF'
server:
  command: "npx -y @modelcontextprotocol/server-filesystem /tmp/demo"
faults:
  - tool: "write_file"
    type: timeout
EOF

# 2. In your agent's MCP config, replace the real server with the proxy:
#    "command": "uvx",
#    "args": ["mcp-chaos", "run", "-c", "/abs/faults.yaml", "--record", "/abs/run.jsonl"]

# 3. Use your agent normally, then render the report
uvx mcp-chaos report run.jsonl -o report.html

Works with every MCP client — Claude Code, Cursor, Claude Desktop, or anything you built yourself — because it sits at the protocol layer, not inside your agent. --fail-on runaway turns the verdict into a CI exit code.

Honest scope: the proxy sees MCP tool traffic — stdio servers and hosted Streamable HTTP servers — not the agent's chat output. It observes retries, loops and give-up behavior directly; "the agent claimed success while the tool failed" is checked by mcp-chaos correlate, but only when you hand it the transcript (a Claude Code session jsonl or the final answer as text) — the judging rules are auditable regexes, not NLP.

Star the repo → Suggest a fault scenario

Your agent works.Until a tool fails.