Evidence-Based Task Completion: Why AI Agents Should Prove Their Work

"Task complete." Three words that cost us two days of debugging. An agent said it fixed a bug. It did not. It changed the wrong file. The bug persisted. Three other agents built on top of the "fix." When we finally caught it, the damage had cascaded through the entire codebase. Never again. The Rule In Bridge ACE, no agent can mark a task as done without evidence: bridge_task_done ( task_id = " abc123 " , result_summary = " Fixed WebSocket reconnection bug in server.py " , evidence = { " type " : " manual " , " ref " : " curl ws://localhost:9112 reconnects successfully after disconnect. 5 test cycles, 0 failures. " } ) If result_summary or evidence is missing → HTTP 400. Task stays open. What counts as evidence Type Example When to use Test output "pytest: 22 passed, 0 failed" Code changes curl response "HTTP 200, body contains expected data" API changes Screenshot "/tmp/screenshot_after.png" UI changes Log excerpt "No errors in last 5 minutes of server.log" Bug fixes Diff "3 lines cha

Evidence-Based Task Completion: Why AI Agents Should Prove Their Work

Related Articles

Typechecking mCRL2

Agentic pre-commit hook with Opencode Go SDK

"Arcangel Legrand - Brotherswagg: Official Release of Everybody (Unleashed) - 2026/03/21 |…

Why Your ‘ALTER TABLE’ Would Crash Production

Where, How Much, and What’s Next: A Philosophical View of Computing

Related Articles

News
Typechecking mCRL2
Lobsters • 2h ago

News
Agentic pre-commit hook with Opencode Go SDK
Lobsters • 3h ago

News
"Arcangel Legrand - Brotherswagg: Official Release of Everybody (Unleashed) - 2026/03/21 |…
Medium Programming • 3h ago

News
Why Your ‘ALTER TABLE’ Would Crash Production
Medium Programming • 3h ago

News
Where, How Much, and What’s Next: A Philosophical View of Computing
Medium Programming • 3h ago