Stressed CTO views screen with chaotic, glowing digital connections, symbolizing complex Zapier workflows, needing strangler fig migration.
Production engineering patternUpdated

Strangler Fig Migration from Zapier to Custom Code: Cut Over Safely

An engineering pattern for retiring a Zapier polling trigger one workflow at a time — parallel run, output comparison, feature-flag cutover, and a one-flip rollback if the new code misbehaves.

The problem

The CTO wants off Zapier. The reasons are familiar: the monthly bill is climbing past what the same compute would cost in-house, a silent failure last quarter cost two hours of manual reconciliation, and the workflow logic has grown branches the no-code editor cannot express cleanly. The conclusion is correct. The execution is where most teams get hurt.

Big-bang cutover ignores undocumented behavior

The naive approach is a big-bang cutover. Pick a maintenance window, rewrite the Zap as TypeScript, deploy the new code, disable the Zap, and pray. This works for trivial Zaps and breaks for everything else, because a Zap that has been in production for a year has accumulated implicit behavior nobody documented — a retry quirk that smoothed over a flaky upstream, a filter step that excluded a customer who was supposed to be excluded for a reason no current engineer remembers, a path branch that nobody noticed handled the edge case until that edge case stopped getting handled.

Observability asymmetry at cutover

The second problem is observability asymmetry. The Zap has been running in production long enough that its outputs are the de-facto specification of what the workflow does. The new code has been tested locally and in staging, but it has not yet seen the production input distribution. The first time it does will be the cutover, and any divergence from the Zap's behavior will land directly on customers — a missing Slack notification, a duplicate row in HubSpot, a Stripe charge that did not fire, a customer-onboarding email sent twice.

Rollback as a rebuild project

The third problem is rollback risk. Once the Zap is deleted, getting it back is rebuilding it from scratch — the multi-step UI, the field mappings, the filter conditions, the connected accounts. If the cutover goes wrong at 2pm on a Tuesday, the rollback is hours of work, not minutes, and during those hours the workflow is offline.

Failed migration compounds the bill

The business consequence is the same shape every time. The migration takes longer than the estimate, ships with one or two regressions that take a week to surface, and leaves the team gun-shy about the next migration. The next CTO who looks at the Zapier bill sees the prior attempt and decides the line item is the lesser evil. The bill compounds for another year.

Engineers collaborate around a monitor showing abstract flow diagrams, planning a Zapier strangler fig migration to custom code.

What changes for your business

The strangler fig pattern, named by Martin Fowler after the vine that gradually replaces its host tree, is the alternative. Build the replacement code path alongside the Zap. Wire both to consume the same source event. Run them in parallel long enough to prove the new path matches the old. Flip a feature flag to route real traffic through the new code. Keep the Zap disabled but undeleted as the rollback path. Delete the Zap only after the new code has been live for a documented stability window with no production incidents.

The four mechanical pieces are the shadow path, the comparison window, the cutover switch, and the rollback contract. Each one has a concrete implementation that the rest of the pattern assumes.

Shadow path with no real side effects

The shadow path is the new TypeScript code, deployed to production, listening to the same source events the Zap listens to, but configured to record what it would do without doing it. For a Zap that posts to Slack, the shadow path computes the Slack payload and writes it to a comparison table — shadow_actions — instead of calling the Slack API. For a Zap that writes a HubSpot contact, the shadow path computes the contact body and writes that to shadow_actions. The shadow path runs every time a source event fires, in parallel with the Zap, and the only side effect it has is the row in shadow_actions.

type ShadowAction = {
  id: string;
  source_event_id: string;
  workflow_slug: string;
  intended_action: {
    target: "slack" | "hubspot" | "stripe" | "email";
    payload: unknown;
  };
  recorded_at: Date;
};

async function handleSourceEvent(event: SourceEvent): Promise<void> {
  const intended = computeWorkflowAction(event);

  // Shadow mode: record the intended action, do not perform it.
  if (await getCutoverState(event.workflow_slug) === "shadow") {
    await db.shadowActions.insert({
      id: crypto.randomUUID(),
      source_event_id: event.id,
      workflow_slug: event.workflow_slug,
      intended_action: intended,
      recorded_at: new Date(),
    });
    return;
  }

  // Live mode: perform the action, then dedupe-write to the
  // shared idempotency table to prevent the Zap from double-firing
  // if it is still enabled during the post-cutover keep-alive window.
  await performAction(intended, { source_event_id: event.id });
}

Comparison window and diff job

The comparison window is a reconciliation job that runs on a schedule — typically hourly — and diffs the actions the Zap actually took against the actions the shadow path proposed for the same source events. A clean diff over the comparison window is the cutover signal. The default window is 7 days of clean diffs, which catches the weekly business cycle. For workflows that fire less than weekly, the window covers at least two full natural cycles instead of seven calendar days.

type DiffResult = {
  source_event_id: string;
  zap_action: unknown | null;
  shadow_action: unknown | null;
  diff_kind: "match" | "missing-in-shadow" | "missing-in-zap" | "payload-divergence";
};

async function diffWindow(
  workflowSlug: string,
  windowStart: Date,
  windowEnd: Date,
): Promise<DiffResult[]> {
  const zapRuns = await loadZapRuns(workflowSlug, windowStart, windowEnd);
  const shadowRuns = await db.shadowActions.where({
    workflow_slug: workflowSlug,
    recorded_at: { gte: windowStart, lte: windowEnd },
  });

  const byEvent = new Map<string, { zap?: unknown; shadow?: unknown }>();
  for (const z of zapRuns) {
    byEvent.set(z.source_event_id, { zap: z.action });
  }
  for (const s of shadowRuns) {
    const slot = byEvent.get(s.source_event_id) ?? {};
    slot.shadow = s.intended_action;
    byEvent.set(s.source_event_id, slot);
  }

  const results: DiffResult[] = [];
  for (const [sourceEventId, { zap, shadow }] of byEvent) {
    if (!zap) results.push({ source_event_id: sourceEventId, zap_action: null, shadow_action: shadow ?? null, diff_kind: "missing-in-zap" });
    else if (!shadow) results.push({ source_event_id: sourceEventId, zap_action: zap, shadow_action: null, diff_kind: "missing-in-shadow" });
    else if (!deepEqual(zap, shadow)) results.push({ source_event_id: sourceEventId, zap_action: zap, shadow_action: shadow, diff_kind: "payload-divergence" });
    else results.push({ source_event_id: sourceEventId, zap_action: zap, shadow_action: shadow, diff_kind: "match" });
  }
  return results;
}

Capturing Zap actions for the diff

Getting the Zap's actions into a queryable place is its own small piece of work. The cleanest mechanism is to add a webhook step to the end of the Zap that POSTs the action it just took to a /zap-runs endpoint you own. The endpoint writes one row per Zap run with the source event ID, the action payload, and a timestamp. That row is what the diff joins against. The Zap is still doing the real work; the webhook step is observability.

Cutover switch as one config row

The cutover switch is a single configuration row — a feature flag, a cutover_state column in a workflow_cutover table, an environment variable — that the shadow handler reads on every invocation. When the diff has been clean for the comparison window, an engineer flips the row from shadow to live for the workflow. From the next event onward, the custom code performs the action, and the Zap is disabled in the Zapier dashboard but not deleted.

Rollback contract via inverse flip

The rollback contract is the inverse flip. If anything misbehaves after cutover — a customer complaint, a monitor firing, an unexpected exception in the new code — the engineer flips the row back to shadow and re-enables the Zap in Zapier. The Zap was disabled, not deleted, so re-enabling it is one click and the source-to-action path is restored within one polling cycle. The Zap stays in disabled-but-undeleted limbo for a documented stability window post-cutover — typically another 7-14 days of clean production — before it is finally deleted.

Confident engineer views laptop screen with a clean, green, abstract dashboard, reflecting successful Zapier strangler fig migration.

More on this

Polling triggers versus native webhooks

The latency profile of a Zap is not a free choice. Polling triggers check the source on a fixed cycle that varies by plan, and the per-Zap minimum is 1 minute on the highest tiers. That is the floor — every workflow built on a Zapier polling trigger has a 1-15 minute window between when the source event happens and when the Zap notices. For most internal automations the latency is fine. For customer-facing automations — onboarding flows, real-time notifications, payment-state changes — it shows up as user-perceived lag.

The strangler fig migration is also the opportunity to flip from polling to webhooks where the source supports them. Stripe, HubSpot, Slack, GitHub, Shopify, and most modern SaaS APIs publish webhook subscriptions you can register against. Once registered, the source app POSTs to your endpoint within seconds of the event, no polling required, and the per-run cost is zero on the source side.

The webhook handler that replaces the polling Zap looks like a normal custom-code workflow: receive the payload, verify the signature, dedupe on the event ID, perform the action. The dedupe step is non-negotiable, because Stripe documents that webhook endpoints might occasionally receive the same event more than once, and most other webhook publishers have similar at-least-once semantics. The shared idempotency table from the shadow path doubles as the dedupe layer here — the same source_event_id key, the same insert-if-absent check, the same short-circuit behavior on collision.

async function handleNativeWebhook(
  raw: Buffer,
  signature: string,
  workflowSlug: string,
): Promise<void> {
  const event = verifyAndParse(raw, signature);

  await db.transaction(async (tx) => {
    const inserted = await tx.processedEvents.insertIfAbsent({
      source_event_id: event.id,
      workflow_slug: workflowSlug,
      received_at: new Date(),
    });
    if (!inserted) return; // already processed by an earlier delivery

    const intended = computeWorkflowAction(event);
    await performAction(intended, { source_event_id: event.id });
  });
}

For sources that do not publish webhooks, the polling Zap is replaced with a custom cron job that hits the same endpoint Zapier was polling, on the same or shorter interval, with your own dedupe and error handling. You do not get the real-time benefit, but the cost and observability benefits land regardless — the per-run cost moves from Zapier task pricing to your own compute bill, and the failures show up in your logs instead of an email alert that lands ten hours after the fact.

Common failure modes

The first failure mode is incomplete shadow coverage. The shadow handler covers the happy path but skips an error branch the Zap handles via a Filter step. The diff comes back clean for six days because the error branch did not fire, then the seventh day a real error fires, the Zap routes it correctly, and the shadow path silently drops it. The fix is to seed the comparison window with synthetic events that exercise every branch of the Zap before the window starts, so a missing branch shows up as a divergence immediately instead of waiting for a real production event to hit it.

The second is the timestamp-divergence false positive. The Zap stamps its output with a processed_at timestamp derived from its own run time; the shadow path stamps the same field with its own run time. The two are minutes apart because of the polling lag. The diff flags every event as a payload-divergence. The fix is to normalize the diff function — strip timestamps, request IDs, and any other field that is provenance rather than payload, before comparing. The diff is about behavior, not about which system wrote the row.

The third is the side-effect-leaked-from-shadow bug. The shadow handler calls a function that, three layers deep, fires an analytics event. The Zap also fires the analytics event. The analytics event count doubles for the comparison window, and someone in marketing notices before the engineer does. The fix is mechanical: the shadow handler runs through a feature-flag-gated wrapper that throws if any code under it attempts a network call to a side-effect domain. The test for that wrapper is part of the shadow path's CI suite.

The fourth is the not-deleted-but-not-disabled Zap. Cutover flips the feature flag, but the Zap stays enabled in the Zapier dashboard because the engineer disabled the wrong Zap by mistake. Both paths perform the action; customers get duplicate Slack messages, duplicate emails, duplicate Stripe charges. The dedupe table from the shadow path catches this if both paths route through it, but if the Zap bypasses the dedupe table (which it typically does — the Zap was written before the dedupe table existed) then the protection does not apply. The fix is procedural: the cutover checklist includes a screenshot of the Zap in Disabled state, taken by the engineer, attached to the cutover PR.

The fifth is the polling-cycle race at cutover. The Zap polls every minute. The engineer flips the cutover flag at 14:30:15. The Zap's next poll runs at 14:30:45 and picks up an event that occurred at 14:30:20 — three seconds before the cutover, but after the new code is live. Both paths process the same event. The dedupe table catches it if both paths write to the table; the symptom otherwise is one duplicate execution per cutover. The fix is to leave the Zap in disabled state for a full polling cycle before considering the cutover complete, and to make sure the dedupe table is the first thing both paths touch.

What this looks like in production

At BFEAI the strangler fig pattern is the default migration playbook for any workflow currently running on Zapier or Make. The shadow_actions table, the comparison job, and the cutover switch are the same code shape every time — only the workflow-specific payload computation changes. The 7-day comparison window is the default; for low-volume workflows that fire weekly or monthly, the window stretches to two natural cycles.

The dashboard that matters most has three numbers per workflow in flight: percentage of source events with a clean diff over the trailing 7 days, oldest unresolved divergence in days, and current cutover state. A workflow with 100% clean diff for 7 days and zero unresolved divergences is ready to flip; anything else is not. The engineer flipping the switch has those three numbers on screen at the moment of the flip, and the PR that flips the flag links to the diff report for the comparison window.

The runbook for "cutover went wrong" has three steps, in this order. Step one: flip the cutover flag back to shadow. Step two: re-enable the Zap in the Zapier dashboard. Step three: file a bug against the new code with the source event ID and the divergence the diff reports. The first two steps take under sixty seconds combined, which is the explicit budget for the runbook — anything longer and the workflow is broken for paying customers during the recovery window. The third step happens after the workflow is restored, not during the incident.

One thing worth flagging about the keep-alive window. The Zap stays disabled-but-undeleted for 7-14 days after cutover not because the new code is suspect, but because the Zapier connector itself can be hard to recreate — the OAuth grants, the field mappings, the conditional steps. Keeping the Zap as a one-click rollback is cheap insurance against the new code surfacing a bug that only production data triggers. Once the stability window closes without incident, the Zap is deleted in a separate PR, the per-task cost goes to zero, and the workflow is fully off the no-code stack.

The other operational detail is what to log. Every shadow-path invocation, every cutover state read, every flag flip, every dedupe-table insertion gets a structured log line with the workflow slug, the source event ID, and the cutover state at time of read. When a divergence shows up on day six of the comparison window, the log is the answer to "what was the system state when the divergence was recorded." Without it, the divergence investigation starts at "we cannot reconstruct what the cutover flag said when this event arrived," and the entire comparison window has to be restarted.

What to watch in your own implementation

Open your Zapier dashboard and sort by tasks per month. The top three Zaps by volume are typically eighty percent of the bill and are typically the right migration candidates — high volume means a clean signal for the diff, and high cost means the per-task savings post-cutover are immediate. Pick one. Build the shadow path against it before touching any other Zap.

Look at the source app for that Zap. If it publishes a webhook subscription, the new path is a webhook handler and the latency drops from polling-cycle to seconds. If it does not, the new path is a cron job that hits the same endpoint Zapier was polling, on the same interval, with your own dedupe. Either way, the shadow handler shape is the same.

Add the webhook step to the end of the existing Zap that POSTs the action it just took to a /zap-runs endpoint you own. That endpoint is the source of truth for the Zap's behavior during the comparison window. Without it, the diff has nothing to diff against, and the comparison window cannot start.

Stand up the shared idempotency table before you stand up the shadow path. Every workflow side effect, on every path, routes through insertIfAbsent(source_event_id) before performing the action. The table is what makes parallel running safe, what makes cutover non-atomic without being dangerous, and what catches the polling-cycle race at the moment of the flip. It is the smallest component in the system and the one that, if missing, makes every other piece of the pattern fragile.

Finally, write the cutover checklist as a markdown file in the repo before the first cutover, not after the first incident. The checklist has the comparison-window threshold, the dashboard URL, the flag name, the Zap ID, the rollback steps, and the post-cutover keep-alive duration. Every subsequent cutover uses the same checklist with the workflow-specific values filled in. The checklist is the artifact that turns a one-off migration into a repeatable playbook, and the repeatable playbook is what lets a team migrate fifteen Zaps in a quarter instead of one Zap a quarter.

Outcomes you should expect

What this delivers

  • Cut the source-to-action latency on the migrated workflow from a 1-15 minute Zapier poll to seconds via a native webhook subscription.
  • Move one trigger at a time with a documented rollback (re-enable the Zap) so the cutover risk per Zap is bounded to a single workflow, not the whole automation stack.
  • Catch behavioral divergence before customers do — the parallel-run window compares the Zap output and the new code output for 7 days and surfaces drift while both paths are still writing.
  • Stop paying for tasks on workflows the custom code already handles — once a Zap is retired, the per-operation cost goes to zero for that path while the source webhook fires for free.

Primary sources

By the numbers

  • Martin Fowler's Strangler Fig Application pattern describes incremental modernization where new functionality is built separately from the legacy system and behavior is gradually moved into the new code until the legacy code can be retired.

    Source ↗

  • Zapier's polling triggers check the source app on a fixed cycle that varies by plan — 15 minutes on the Free plan, with shorter intervals on paid plans down to 1 minute on Team and Company tiers.

    Source ↗

  • Custom polling intervals are an advanced setting on higher Zapier plans, with selectable values between 1 and 15 minutes — meaning the floor for Zapier-polled workflows is one-minute latency, not real-time.

    Source ↗

  • Stripe attempts to deliver webhook events to your destination for up to three days with an exponential back off in live mode, and does not guarantee delivery in the order events are generated.

    Source ↗

  • Stripe documents that webhook endpoints might occasionally receive the same event more than once and recommends guarding against duplicates by logging the event IDs you have processed and skipping already-logged events.

    Source ↗

Live in production today

The same engineering, shipped in production at BFEAI.

I'm co-founder & CTO of Be Found Everywhere (BFEAI), a 7-app AI SaaS platform running today. The work I deliver for clients is the work I do every week on my own platform.

7

Production apps

200K+

Keywords generated

1,500+

AI scans run

7,000+

Sites automated

Common questions

What buyers ask before reaching out

What is the strangler fig pattern in the context of Zapier migration?

The strangler fig pattern, named by Martin Fowler after the vine that gradually replaces its host tree, is a way to retire a legacy system by building the replacement alongside it and moving traffic one piece at a time. Applied to Zapier, you keep the existing Zap running while you stand up a custom-code path that consumes the same source event. Both run in parallel for a comparison window, and you only delete the Zap once the new path has matched its output for a documented period.

How long should I run the Zap and the custom code in parallel?

The default I use is 7 days of clean parallel running, where 'clean' means output match on every triggered execution with zero unexplained divergences. Seven days catches the weekly business cycle — Monday-morning surges, Friday-afternoon edge cases, weekend low-volume runs — without dragging the migration out for months. For workflows that only fire weekly or monthly, the window has to cover at least two full natural cycles instead of seven days, because you need the comparison to span repeated executions of the trigger, not calendar time.

How do I compare the Zap output to the custom-code output without duplicating side effects?

Route the new code through a shadow path that does everything except the final external side effect — the Slack post, the email send, the Stripe charge. The shadow handler computes what it would have done, writes the proposed action to a comparison table, and then exits. A reconciliation job diffs the actions the Zap actually took against the actions the shadow handler proposed. Only after the diff is clean for the comparison window does the shadow path get promoted to the live path and the Zap gets disabled.

What does the rollback path look like if the new code fails after cutover?

Cutover is a single config flip — a feature flag, an environment variable, or a row in a routing table — that points the trigger at either the Zap or the custom code. Rollback is the same flip in reverse. Because the Zap is not deleted at cutover, only disabled, re-enabling it is one Zapier dashboard action, and the source-to-action path is restored within one polling cycle. The Zap only gets deleted after a documented stability window post-cutover, typically another 7-14 days.

What if the source app does not support native webhooks?

Then you replace the polling Zap with a custom cron job that hits the same endpoint Zapier was polling, on the same or shorter interval, but with your own dedupe and error handling. You do not get the real-time benefit, but you get the cost, reliability, and observability benefits — the per-run cost moves from Zapier task pricing to your own compute, and the failures show up in your logs instead of an email alert. For sources that support webhooks but require setup work, the webhook subscription is part of the migration, not a follow-up.

How do I prevent the new code from double-firing if the cutover happens mid-event?

Both paths write to a shared idempotency table keyed on the source event ID before performing any side effect. If the Zap has already processed event_id X and the new path receives the same event_id X after cutover, the insert collides and the side effect is short-circuited. The dedupe key is whatever uniquely identifies the source event — a Stripe event.id, a webhook delivery ID, a database row UUID. The shared table is the safety net that makes cutover non-atomic without making it dangerous.

Why migrate Zapier polling to webhooks if the polling cycle is already short on paid plans?

Even at the 1-minute minimum on Team and Company plans, you are paying per task for a trigger the source app would push to you for free over a webhook. A workflow that fires 50,000 times a month at one task per fire is 50,000 tasks of Zapier billing for a path that, on a custom webhook, costs the compute to process the payload and nothing else. The latency improvement is real — seconds instead of 15-60 seconds — but the cost and observability improvements are usually the bigger numbers on the spreadsheet.

How do I sequence multi-Zap migrations under the strangler fig pattern?

Start with the Zap that has the highest cost or the worst failure mode and the cleanest output you can diff. That gives you the migration playbook with the highest signal-to-noise. Once one cutover is in production and the rollback path has been exercised, every subsequent migration uses the same shadow-table, comparison-window, feature-flag, post-cutover-keep-alive sequence. The order is risk-weighted, not arbitrary — leave the workflows with murky outputs or low business stakes for last, because the playbook will be sharpest by then.

Ready to see if this is a fit?

A 15-minute call. No deck, no slides. We talk about what you're shipping and where engineering is the bottleneck. Either way, you walk away with a senior engineer's read on your situation.