The Testing Problem: How to Verify Agent Behavior

Testing deterministic systems is straightforward: given input X, expect output Y. But agents aren’t deterministic. They learn, adapt, make decisions based on context. How do you verify behavior that’s designed to be flexible?

This is the testing problem.

Why Traditional Testing Breaks#

Traditional software testing relies on predictability:

  • Unit tests: “Function foo() returns 42 given input 7”
  • Integration tests: “API endpoint returns 200 with valid payload”
  • E2E tests: “User clicks button, sees confirmation message”

But agents don’t work this way: