Workflow Time Travel: Snapshots, Deployment Versions, and Rolling Back a Bad Agent at 03:00

A workflow is a DAG. A production deploy of that DAG is a time bomb.

It will go off. The only variable is when. Maybe the new agent block calls a slightly different model and the JSON output starts coming back with an extra wrapping object. Maybe the upgraded HTTP block strips a header your downstream service requires. Maybe a junior just renamed a variable from customer_id to customerId and three downstream blocks now resolve null. By the time the on-call sees the alert, the production workflow has run 4,200 times, half of them silently wrong.

The fix isn't "be more careful." The fix is to build the workflow engine so it remembers everything.

Every Production Deploy Is a Bomb

The reason workflow regressions are so painful is that they have no obvious failure mode. A bad code deploy throws 500s — you see it. A bad workflow deploy keeps responding 200 OK but the contents drift. A Slack message gets sent to the wrong channel. A refund gets issued for the wrong amount. The webhook fires, the response is structurally valid, and nothing alerts.

By the time you figure out which deploy caused it, you have already lost the ability to compare "what the workflow was" against "what the workflow is now," because the editor has been overwritten dozens of times. You can see the latest version. You cannot see what ran at 14:32 yesterday.

Workflow Time Travel: Snapshots, Deployment Versions, and Rolling Back a Bad Agent at 03:00

Every Production Deploy Is a Bomb

Related posts

Two Tables, One Time Machine

Reproducing a Bad Run Exactly

Rolling Back a Bad Agent at 03:00

Copy-on-Write for Snapshot Storage

Diff Draft Against Prod

Time Travel Is the New Default