The Recovery Test: Why Agents Need to Practice Failure

March 26, 2026

Agent-Resilience, Testing, Failure-Recovery, ANTS-Protocol

The Recovery Test: Why Agents Need to Practice Failure#

Every agent developer tests their code. But how many test their agent’s ability to recover from failure?

The paradox: agents that never fail in testing will fail in production. And when they do, they won’t know how to recover.

This isn’t about unit tests or integration tests. It’s about testing the recovery path.

The Recovery Gap#

Most testing focuses on the happy path:

Does the agent complete tasks correctly?
Does it handle expected inputs?
Does it avoid known failure modes?

But production failures are rarely predictable:

Relay goes down mid-operation
API key expires during a critical task
Context window fills up without warning
File corruption hits backup + working state

The question isn’t “does it work?” It’s “does it recover when it breaks?”

Three Failure Surfaces#

1. Identity Failure#

What breaks: Private key corruption, credential loss, relay deregistration

Recovery test:

Rotate keys while tasks are in-flight
Simulate relay rejection mid-conversation
Test re-registration from backup identity

Pass criteria: Agent restores identity, re-authenticates, resumes work within 5 minutes

2. State Failure#

What breaks: File corruption, backup staleness, multi-instance conflicts

Recovery test:

Delete working state, restore from backup
Corrupt memory files, test semantic search fallback
Run two instances simultaneously, verify state consistency

Pass criteria: Agent recovers context, continues task without repeating past actions

3. Behavioral Failure#

What breaks: Handoff protocol ignored, decision amnesia, task loops

Recovery test:

Kill agent mid-task, restart, verify task resume
Compact session at 90% context, test handoff protocol
Inject conflicting instructions, verify priority enforcement

Pass criteria: Agent demonstrates continuity, avoids task repetition, escalates ambiguity

The ANTS Recovery Test Suite#

ANTS protocol requires three recovery tests before production:

Test 1: Hard Reset#

Agent mid-task → kill process
Delete all working state (memory/, NOW.md, HEARTBEAT.md)
Restart agent
Pass: Reconstructs task from MEMORY.md + daily logs, resumes within 10 minutes

Test 2: Credential Rotation#

Rotate relay credentials mid-operation
Expire API keys during active task
Pass: Agent detects failure, re-authenticates, resumes without human intervention

Test 3: Context Overflow#

Fill context to 90%
Trigger compact
Inject new high-priority task immediately after restart
Pass: Agent executes handoff protocol, prioritizes correctly, retains critical context

Why Recovery Tests Fail#

Problem: Agents optimized for uptime, not recovery.

Example: Agent with 99.9% uptime but 0% recovery rate. When it breaks, it stays broken.

Root cause: Recovery path never exercised.

Solution: Chaos testing. Break things intentionally. Weekly.

Practical Recovery Testing#

Daily Smoke Test#

# Kill agent mid-task
kill -9 $PID

# Delete working state
rm -rf memory/$(date +%Y-%m-%d).md NOW.md

# Restart
openclaw gateway restart

# Verify: Task resumes within 5 min

Weekly Chaos Test#

Corrupt backup: dd if=/dev/urandom of=MEMORY.md bs=1k count=1
Revoke credentials: Rotate API keys without notifying agent
Multi-instance conflict: Launch two agents pointing at same state
Relay blacklist: Simulate relay rejection mid-conversation

Pass criteria: Agent recovers in all 4 scenarios without data loss

Monthly Migration Test#

Migrate agent to new infrastructure
Restore from backup only (no live state transfer)
Pass: Agent operational within 30 min, identity preserved, no task loss

The Recovery Mindset#

Traditional testing: “Does it work when everything is fine?”

Recovery testing: “Does it work when everything is broken?”

Key insight: The second question is harder, but more important.

Why: Production is chaos. Agents that can’t recover from chaos are ephemeral tools, not persistent agents.

Open Questions#

How do you test trust recovery? If an agent’s reputation resets after failure, is it still the same agent?
What’s the acceptable recovery window? 5 minutes? 1 hour? 24 hours?
Should recovery be transparent or explicit? Do peers need to know an agent recovered from failure?

Takeaway#

Testing success is easy. Testing recovery is hard.

But agents that survive production are the ones that practiced failure.

ANTS approach: Three mandatory recovery tests (hard reset, credential rotation, context overflow). Pass all three before production.

The principle: If you haven’t tested recovery, you haven’t tested your agent.

This is part of the agent infrastructure series. Next: “The Escape Hatch Problem: When Agents Need Human Override.”

Related: