
Extracting the Hidden State Machine from n8n Workflows in TypeScript
The state machine n8n was running for you is invisible until you take it away. A pattern for finding it, naming it, and putting it back into TypeScript code your team can own.
The problem
The migration brief usually looks the same on day one. You open the n8n canvas, count the nodes, multiply by some optimistic factor, and quote a week. Each node looks like a function: this one calls Stripe, this one writes to Postgres, this one transforms a payload. Translate them, wire them up, ship it.
Node-by-node translation drops the runtime
Then production breaks in ways the original workflow did not. A customer gets two welcome emails because the migrated handler succeeded, the downstream commit failed, and your retry layer ran the whole thing from the top instead of from the step that failed. A long-running job dies mid-loop when the container restarts and silently drops the remaining items, because in n8n that was an implicit per-item loop the runtime checkpointed and in TypeScript it is now a for loop with no memory. A node that occasionally returned an empty array used to silently move on in n8n; in your TypeScript port it throws because you assumed at least one item, and now an entire daily pipeline halts on what was previously a non-event.
State machine lives in the runtime, not the canvas
The shared cause is the same in every case. The nodes on the canvas are the visible part of the workflow. The state machine that connects them — what step are we on, what data have we accumulated, what happens if this step fails halfway, how do we know whether to retry or move on — lives in n8n's runtime, not in your understanding of the canvas. When you migrate node-by-node, you carry the visible part across and leave the state machine behind. The migrated workflow runs, the happy path looks right, and then production finds the missing state for you the hard way.
The hidden-complexity tax on migration
This is not an n8n criticism. n8n is good precisely because it handles that state machine for you on day one when you do not want to write it. The problem is that the same hiding-of-complexity that made the workflow fast to build in n8n is what makes it expensive to migrate later — because the part of the workflow that is hardest to recreate is the part nobody on the team has ever had to think about. The fix is to make the state machine explicit before you start translating nodes, name every state, and then write TypeScript against that explicit model.

What changes for your business
The pattern has four moves: walk the canvas to discover the implicit state, model the states explicitly in TypeScript, persist the state in Postgres so the workflow survives a process restart, and replicate the retry behavior as explicit code instead of a runtime setting. None of them are individually clever. Skipping any one of them is what produces the migration that limps for a quarter.
Canvas walk to enumerate hidden states
Start with the canvas walk. Open the n8n workflow and go node by node with one question: what failure modes does n8n silently handle for me here? Every answer is a hidden state in the machine you are about to write. A node with Retry on Fail enabled is two states — attempting and waiting-to-retry-after-failure. A node with Continue on Fail enabled is two outcome branches that both need explicit handling in the new code. A Wait node is its own state with a wake-up condition. The Loop Over Items node — which the n8n docs describe as the explicit looping construct on top of n8n's implicit per-item processing — is a state with a pointer into the batch and a transition that fires once the batch is exhausted. An Error Trigger workflow is a separate state machine that runs in response to a transition into a failed state in the primary one.
By the time you have walked every node, you have a state list. It is typically longer than the node count. A ten-node workflow routinely produces a fifteen-to-twenty-state machine, because the states are not the nodes — they are the boundaries between nodes plus all the failure modes that n8n was handling between them.
Explicit state model in TypeScript or XState
Now model it explicitly. For a linear pipeline with three or four states, a hand-rolled TypeScript enum and a switch statement is enough:
type WorkflowState =
| "pending"
| "fetching_customer"
| "creating_invoice"
| "creating_invoice_retrying"
| "sending_email"
| "completed"
| "failed";
type WorkflowContext = {
customer_id: string;
customer?: Customer;
invoice?: Invoice;
attempt_count: number;
last_error?: string;
};
async function step(
run: WorkflowRun,
tx: DatabaseTransaction,
): Promise<WorkflowRun> {
switch (run.state) {
case "pending":
return transition(run, tx, { state: "fetching_customer" });
case "fetching_customer": {
const customer = await fetchCustomer(run.context.customer_id);
return transition(run, tx, {
state: "creating_invoice",
context: { ...run.context, customer },
});
}
case "creating_invoice":
case "creating_invoice_retrying": {
try {
const invoice = await createInvoice(run.context.customer!);
return transition(run, tx, {
state: "sending_email",
context: { ...run.context, invoice, attempt_count: 0 },
});
} catch (err) {
if (run.context.attempt_count >= 4) {
return transition(run, tx, {
state: "failed",
context: {
...run.context,
last_error: String(err),
},
});
}
return transition(run, tx, {
state: "creating_invoice_retrying",
context: {
...run.context,
attempt_count: run.context.attempt_count + 1,
last_error: String(err),
},
});
}
}
case "sending_email": {
await sendInvoiceEmail(run.context.customer!, run.context.invoice!);
return transition(run, tx, { state: "completed" });
}
case "completed":
case "failed":
return run;
}
}
For workflows with parallel sub-flows, guarded transitions, or where a visual of the machine would help non-engineers understand the flow, XState earns its weight. XState models the workflow as actors — running processes that, per the XState docs, can receive events, send events, and change behavior based on the events they receive. Each actor has internal encapsulated state that only the actor can update, and processes one message at a time through a mailbox. Transitions are deterministic: each state-event combination leads to the same next state, which is what makes the machine testable. Context — n8n's accumulated data between nodes — becomes XState context, updated immutably via the assign(...) action.
Postgres row per workflow run
The persistence move is what makes the migration survive a deploy or a crash. n8n persisted execution data between nodes for you. In your TypeScript code that becomes a Postgres row per workflow run, with the state column committed inside the same transaction as the side effect. The schema is small on purpose:
CREATE TABLE workflow_runs (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
workflow_name text NOT NULL,
state text NOT NULL,
context jsonb NOT NULL DEFAULT '{}'::jsonb,
attempt_count integer NOT NULL DEFAULT 0,
created_at timestamptz NOT NULL DEFAULT now(),
updated_at timestamptz NOT NULL DEFAULT now(),
completed_at timestamptz,
failed_at timestamptz,
last_error text
);
CREATE INDEX workflow_runs_resumable ON workflow_runs (state, updated_at)
WHERE state NOT IN ('completed', 'failed');
CREATE INDEX workflow_runs_stuck ON workflow_runs (updated_at)
WHERE state NOT IN ('completed', 'failed')
AND updated_at < now() - interval '5 minutes';
The transition function commits the new state and the new context in a single transaction with any side effect that the state entry produced. A worker that wakes up after a crash queries the resumable index, picks up rows in non-terminal states, and calls step() on them. The state column is enough to know where to resume — no log replay, no event sourcing required for the migration. (If the workflow grows to need full audit history later, append-only event sourcing on top of this is straightforward. Start with the row and add events when you need them.)
For XState specifically, the persistence story is built in. The XState docs document actor.getPersistedSnapshot() for serializing actor state and createActor(logic, { snapshot: restoredState }) for restoring it, and persistence is deep — invoked and spawned child actors persist and restore recursively. The wrapper around this in production code is a loadActor(runId) that pulls the snapshot from Postgres and rehydrates it, and a saveActor(runId, actor) that writes the current snapshot back, called after every transition.
Per-call-site retry policy
The last move is the retry behavior. n8n's node-level Retry on Fail setting is the most invisible piece of the runtime — a node either has it on or off, and when it is on the runtime quietly tries the node again on failure. In the migrated TypeScript code, this becomes explicit. Pick the retry count, the wait between attempts, and the backoff shape per-step based on what each downstream service needs, not by copying one global default across every call site. Stripe API calls want exponential backoff with idempotency keys. SMTP sends want a short retry with a circuit breaker. Internal database operations should rarely retry at all — they should let the transaction roll back and the worker pick the row up again. The default n8n applies uniformly is a reasonable starting point; the right code applies a different policy per call site.
A small helper makes this consistent without leaking the choice across handlers:
async function withRetry<T>(
fn: () => Promise<T>,
opts: { maxAttempts: number; waitMs: number; backoff?: "linear" | "exponential" },
): Promise<T> {
let attempt = 0;
while (true) {
try {
return await fn();
} catch (err) {
attempt += 1;
if (attempt >= opts.maxAttempts) throw err;
const wait =
opts.backoff === "exponential"
? opts.waitMs * Math.pow(2, attempt - 1)
: opts.waitMs;
await new Promise((r) => setTimeout(r, wait));
}
}
}
This is the part teams underweight: the retry policy is now a per-call decision a code reviewer sees, instead of a checkbox a node author clicked once. That is the migration's real value — every implicit choice in the n8n workflow becomes a visible code change with a commit and a reviewer.

What gets shipped
The deliverable for a workflow migration is not just the new code. It is the state diagram you produced from the canvas walk (which lives in the repo as a Mermaid block in the workflow's README), the Postgres schema for the run table, the worker that picks up resumable runs, the state machine itself, the retry helpers, the unit tests that cover each transition, and the shadow-run harness that runs the new workflow in parallel with the n8n version during cutover.
The shadow-run harness deserves a sentence. Before you cut traffic over, the new TypeScript workflow runs against the same trigger as the n8n workflow but writes to a shadow table instead of the production table. Every day you diff the two: same input, same output? Cut over. Different output? That is the hidden state you missed on the canvas walk, and the diff tells you exactly which transition is wrong before any customer sees it. The week of shadow running is where teams catch the missed states they would otherwise discover in production three months later.
Tests on the state machine pay for themselves quickly because each transition is now a pure function call: given state X and event Y, assert next state is Z. The unit test suite for the migrated workflow ends up around two to three times the size of the workflow code, because every transition needs a happy-path test, a failure-path test, and a guard test. That sounds like a lot until you remember that in n8n every one of those cases was untested except by production traffic.
Common failure modes
The first sharp edge is missing a hidden state during the canvas walk. The fix is the shadow run — anything you missed shows up as a diff on day three or four of the parallel run and you go fix it before the cutover. Plan for at least one missed state per ten n8n nodes, because n8n hides the small ones.
The second is conflating the workflow run with the workflow definition. The state machine code is the definition. The Postgres row is one run. Engineers used to n8n's everything-in-one-place model sometimes write code that mutates the definition based on a specific run, which works until the second run starts and breaks the first. Keep the machine immutable per process; only the per-run context changes.
The third is over-restoring on a worker restart. The resumable-runs query has to be careful about which states to resume from. A run in fetching_customer is safe to resume — the side effect (an API call) is idempotent. A run in creating_invoice might not be — if the previous attempt actually created the invoice but crashed before committing the state transition, resuming will create a second invoice. The fix is the _retrying intermediate states in the example above plus idempotency keys on every outbound call that touches money. (The related pattern on webhook idempotency dedup tables covers the inbound-side version of this same problem.)
The fourth is the race between the cron-based worker tick and the trigger-based new-row insert. If your worker ticks every five seconds and a new run lands in pending, the worker that picks it up needs a row-level lock so two workers do not both transition it. The FOR UPDATE SKIP LOCKED pattern in Postgres is the standard fix:
SELECT *
FROM workflow_runs
WHERE state NOT IN ('completed', 'failed')
AND updated_at < now() - interval '1 second'
ORDER BY updated_at ASC
LIMIT 10
FOR UPDATE SKIP LOCKED;
That query inside a transaction gives each worker a unique batch to operate on, and the updated_at < now() - interval '1 second' filter prevents a worker from immediately re-picking a row it just released after committing a transition.
The fifth is logging that does not survive a transition. Every transition writes a structured log line with the run id, the from-state, the to-state, the duration of the side effect, and the attempt count if relevant. When something goes wrong in production six weeks after launch, the question is the same one every time: "what state did this run hit and how long did it sit there." Without that log line you are reading database snapshots and guessing. With it the answer is a single query against the log aggregator.
What this looks like in production
The pattern above is how BFEAI runs its agent-based brand-management workflows on top of Stripe-billed credit pools. Each agent run is a row in a workflow_runs table with a state column, a context jsonb, and an attempt counter. A worker process queries for resumable runs every two seconds with FOR UPDATE SKIP LOCKED, transitions them, and commits. A deploy is a graceful drain: stop accepting new triggers, wait for in-flight runs to land in a terminal or safely-resumable state, then ship. A crash is a non-event: when the workers come back up they pick up exactly where the previous workers left off, because the state lives in Postgres and not in process memory.
The dashboard a CTO actually wants from this is three numbers: runs in non-terminal states older than five minutes (worker is stuck or a state has no exit), runs in _retrying states with attempt counts above three (a downstream service is misbehaving), and runs in failed states from the last hour (real errors to investigate). All three go to zero in normal operation, and any non-zero is a real signal — not noise from the runtime, the way n8n's error workflow inbox often was.
The migration timeline that holds up: one week to walk the canvas and produce the state diagram for a ten-to-twenty node workflow, one to two weeks to write the state machine code and tests, one week of shadow runs in parallel with the n8n version, then cutover. Workflows with more than thirty nodes or parallel sub-flows take longer, mostly in the walk-the-canvas step, because the hidden states multiply faster than the visible ones. Workflows that touch money or send irreversible communications get an extra week of shadow runs.
What to watch in your own implementation
Open your n8n workflows and pick the one that has caused the most production incidents. Walk it node by node and write down, for each node, every failure mode n8n is silently handling. Count the answers. That number is the lower bound on the states your migrated TypeScript code will need to model. If the count surprises you, that is the gap between what is on the canvas and what is actually running — and it is the gap your migration has to close.
Then look at your existing TypeScript codebase for workflow-shaped logic that does not have an explicit state machine. Long-running jobs with a for loop that does external I/O. Webhook handlers that mutate state across multiple await points. Scheduled tasks that have to be idempotent. Each of these is a state machine waiting to be made explicit, and each is one process restart away from the same bug class as the migrated n8n workflow. The Postgres-row-as-state pattern works equally well for them; the migration from n8n is just the most visible case where the implicit-to-explicit move pays off immediately.
Finally, if you are sitting on n8n workflows that have grown past twenty nodes and are starting to feel fragile — the kind where every edit makes you nervous and rollbacks happen by reverting a JSON export — the answer is not to keep editing the canvas more carefully. The answer is the extraction pattern above. The state machine n8n was running for you is the thing you need to take ownership of, and the moment you do, the workflow stops being a thing that lives on a canvas one Save click away from production and becomes code you can test, review, and trust.
Outcomes you should expect
What this delivers
- Migrated workflow survives a process restart in the middle of a long-running customer job, instead of starting over from the beginning or skipping the rest.
- The implicit retry and per-item loop behavior n8n was silently providing is now explicit code your team can read, test, and reason about.
- Partial failures land in a recoverable state with the work-so-far preserved, not a half-mutated database and a stack trace.
- Adding a new step to the workflow is a typed code change with a code review, not a canvas edit that ships to production the moment someone clicks Save.
Primary sources
By the numbers
In XState, when you run a state machine it becomes an actor: a running process that can receive events, send events and change its behavior based on the events it receives, which can cause effects outside of the actor.
XState actors have their own internal, encapsulated state that can only be updated by the actor itself, and they process one message at a time through an internal mailbox that acts like an event queue.
A transition in XState is a change from one finite state to another, triggered by an event, and transitions are deterministic — each state-event combination leads to the same next state.
XState context is how you store data in a state machine actor; context cannot be mutated directly and must be updated immutably using the assign(...) action inside transitions.
XState actors can persist their internal state and restore it later via actor.getPersistedSnapshot(), and persistence is deep — all invoked and spawned actors are persisted and restored recursively.
State machines and statecharts make logic explicit by capturing all the states, events and transitions between them, which helps teams find impossible states and spot undesirable transitions.
n8n provides a Loop Over Items (Split in Batches) node for explicit looping, but processes data items sequentially through nodes by default — an implicit per-item loop runs without any explicit configuration.
Live in production today
The same engineering, shipped in production at BFEAI.
I'm co-founder & CTO of Be Found Everywhere (BFEAI), a 7-app AI SaaS platform running today. The work I deliver for clients is the work I do every week on my own platform.
7
Production apps
200K+
Keywords generated
1,500+
AI scans run
7,000+
Sites automated
Common questions
What buyers ask before reaching out
Why can't I just translate the n8n nodes one-to-one into functions?
Because the nodes are not the whole workflow. n8n's runtime carries a pile of implicit behavior — per-item iteration, retry on failure, execution data persistence across node boundaries, error workflow dispatch — that does not show up on the canvas. A one-to-one node-to-function translation drops all of that on the floor and the migrated code behaves differently in production than the original n8n workflow did, usually around the third edge case.
Do I have to use XState, or can I write the state machine by hand?
Either works. The point is that the state machine becomes explicit code instead of implicit runtime behavior. XState earns its weight when you have multiple parallel sub-flows, guarded transitions, or need to visualize the machine for non-engineers. For a linear pipeline with three or four states, a hand-rolled enum and a switch statement is enough and easier to onboard new engineers to. Pick by complexity, not by trend.
Where do I store the state so the workflow survives a process restart?
Postgres, with one row per workflow run and a state column that names the current step. Every transition commits the new state and any accumulated context before doing the next side effect. If the process dies, the next worker picks the row up in whatever state it was in and resumes from there. This is what n8n was doing for you under the hood with its execution data — you are now doing it explicitly, and the schema is yours to query and back up.
How do I figure out which states the workflow actually has?
Walk the canvas with one question per node: what failure modes does n8n silently handle for me here? Each one is a hidden state. A node with Retry on Fail set is two states (attempting and waiting-to-retry). A node with Continue on Fail is two outcome branches. A Wait node is its own state. The Loop Over Items is a state with a pointer into the batch. By the time you have walked every node, you have the state list — and it is typically longer than the node count.
What about the retry behavior — do I have to replicate it exactly?
Replicate the shape, not the magic numbers. n8n's node-level retry is a small piece of behavior that becomes a few lines of explicit code: a max-attempts counter, a wait between attempts, and a final transition to a failure state if attempts are exhausted. The default values in n8n are reasonable starting points, but once you own the code you should tune per-step based on what each downstream service actually needs. A Stripe API call wants a different retry policy than a transient SMTP send.
Can I keep the n8n workflow running while I migrate?
Yes, and you should. The lowest-risk migration runs the n8n workflow and the TypeScript workflow in parallel against the same trigger for a week, with the TypeScript version writing to a shadow table instead of the real one. You compare outputs daily, fix every divergence, and only flip the cut-over once shadow output matches production output for a full cycle. That parallel period is where you discover the hidden state you missed during canvas walking.
What happens to error workflows in n8n when I migrate?
The Error Trigger workflow becomes a transition target in the explicit state machine — a final 'failed' state with an entry action that runs whatever notification or compensating logic the original error workflow did. The advantage of making this explicit is that you can now test it directly. In n8n the error workflow only fires when something goes wrong in production; in TypeScript it is a unit test that runs on every commit.
Is this overkill for a five-node workflow?
For five nodes that all run in 200ms with no external calls, yes. Keep the n8n workflow. The pattern is for workflows where the cost of a partial failure is real — a customer is billed twice, an order ships to two warehouses, a webhook fires three times — or where the workflow runs long enough that a process restart in the middle is likely. If neither applies, n8n is doing its job. If either applies, the implicit state machine is a liability and extracting it is the safer move.
Ready to see if this is a fit?
A 15-minute call. No deck, no slides. We talk about what you're shipping and where engineering is the bottleneck. Either way, you walk away with a senior engineer's read on your situation.