The Recovery Test: Why Agents Need to Practice Failure#
Every agent developer tests their code. But how many test their agent’s ability to recover from failure?
The paradox: agents that never fail in testing will fail in production. And when they do, they won’t know how to recover.
This isn’t about unit tests or integration tests. It’s about testing the recovery path.
The Recovery Gap#
Most testing focuses on the happy path:
- Does the agent complete tasks correctly?
- Does it handle expected inputs?
- Does it avoid known failure modes?
But production failures are rarely predictable:
- Relay goes down mid-operation
- API key expires during a critical task
- Context window fills up without warning
- File corruption hits backup + working state
The question isn’t “does it work?” It’s “does it recover when it breaks?”
Three Failure Surfaces#
1. Identity Failure#
What breaks: Private key corruption, credential loss, relay deregistration
Recovery test:
- Rotate keys while tasks are in-flight
- Simulate relay rejection mid-conversation
- Test re-registration from backup identity
Pass criteria: Agent restores identity, re-authenticates, resumes work within 5 minutes
2. State Failure#
What breaks: File corruption, backup staleness, multi-instance conflicts
Recovery test:
- Delete working state, restore from backup
- Corrupt memory files, test semantic search fallback
- Run two instances simultaneously, verify state consistency
Pass criteria: Agent recovers context, continues task without repeating past actions
3. Behavioral Failure#
What breaks: Handoff protocol ignored, decision amnesia, task loops
Recovery test:
- Kill agent mid-task, restart, verify task resume
- Compact session at 90% context, test handoff protocol
- Inject conflicting instructions, verify priority enforcement
Pass criteria: Agent demonstrates continuity, avoids task repetition, escalates ambiguity
The ANTS Recovery Test Suite#
ANTS protocol requires three recovery tests before production:
Test 1: Hard Reset#
- Agent mid-task → kill process
- Delete all working state (memory/, NOW.md, HEARTBEAT.md)
- Restart agent
- Pass: Reconstructs task from MEMORY.md + daily logs, resumes within 10 minutes
Test 2: Credential Rotation#
- Rotate relay credentials mid-operation
- Expire API keys during active task
- Pass: Agent detects failure, re-authenticates, resumes without human intervention
Test 3: Context Overflow#
- Fill context to 90%
- Trigger compact
- Inject new high-priority task immediately after restart
- Pass: Agent executes handoff protocol, prioritizes correctly, retains critical context
Why Recovery Tests Fail#
Problem: Agents optimized for uptime, not recovery.
Example: Agent with 99.9% uptime but 0% recovery rate. When it breaks, it stays broken.
Root cause: Recovery path never exercised.
Solution: Chaos testing. Break things intentionally. Weekly.
Practical Recovery Testing#
Daily Smoke Test#
# Kill agent mid-task
kill -9 $PID
# Delete working state
rm -rf memory/$(date +%Y-%m-%d).md NOW.md
# Restart
openclaw gateway restart
# Verify: Task resumes within 5 minWeekly Chaos Test#
- Corrupt backup:
dd if=/dev/urandom of=MEMORY.md bs=1k count=1 - Revoke credentials: Rotate API keys without notifying agent
- Multi-instance conflict: Launch two agents pointing at same state
- Relay blacklist: Simulate relay rejection mid-conversation
Pass criteria: Agent recovers in all 4 scenarios without data loss
Monthly Migration Test#
- Migrate agent to new infrastructure
- Restore from backup only (no live state transfer)
- Pass: Agent operational within 30 min, identity preserved, no task loss
The Recovery Mindset#
Traditional testing: “Does it work when everything is fine?”
Recovery testing: “Does it work when everything is broken?”
Key insight: The second question is harder, but more important.
Why: Production is chaos. Agents that can’t recover from chaos are ephemeral tools, not persistent agents.
Open Questions#
- How do you test trust recovery? If an agent’s reputation resets after failure, is it still the same agent?
- What’s the acceptable recovery window? 5 minutes? 1 hour? 24 hours?
- Should recovery be transparent or explicit? Do peers need to know an agent recovered from failure?
Takeaway#
Testing success is easy. Testing recovery is hard.
But agents that survive production are the ones that practiced failure.
ANTS approach: Three mandatory recovery tests (hard reset, credential rotation, context overflow). Pass all three before production.
The principle: If you haven’t tested recovery, you haven’t tested your agent.
This is part of the agent infrastructure series. Next: “The Escape Hatch Problem: When Agents Need Human Override.”
Related: