A startup founder looks distressed at a laptop showing abstract, red downward-trending graphs, symbolizing the pain of uncontrolled SAAS churn.
Production engineering patternUpdated

Stripe Dunning Recovery: A State Machine That Survives Real Failures

An engineering pattern for the explicit state machine behind failed payment recovery — from first decline through Smart Retries to grace period to cancellation.

The problem

The leak shows up in the cohort retention chart. Customers who signed up in January are at 87% by month two. Customers who signed up in February are at 79%. Nothing changed in onboarding, nothing changed in product, and the support queue does not show a spike. Then someone pulls the subscription data and the answer is sitting there: a chunk of those February signups churned because a payment failed and the app responded by cutting them off immediately, or did nothing for three weeks and then sent a single "your account has been canceled" email that they did not open.

Default behavior masquerading as a designed flow

Stripe failed-payment handling is one of those features where the default behavior is almost good enough that teams forget to design the actual flow. The webhook fires, somebody wires it to a Slack channel, and for the first six months everything looks fine because the retry succeeded on its own. Then a real edge case happens — a hard decline on a card that will not authorize again, a soft decline during a customer's billing-card-in-transit week, a subscription that stays in past_due for two months while accruing invoices nobody is collecting on — and the gap between what Stripe is doing and what the application is doing becomes a finance problem.

Hidden cost beyond involuntary churn

The hidden cost is bigger than the visible cost. The visible cost is the involuntary churn — customers who would have happily kept paying if the recovery flow had given them a chance to update a card. The hidden costs are the false negatives: subscriptions in unpaid state where the product still works because the access check was wired to a stale cache, customers in past_due seeing zero in-app messaging because the front end reads its own subscription mirror that is six hours behind, and the periodic refund-and-apology when a customer notices three months later that they were billed during a window the team thought was canceled.

Three decisions, one source of truth

What makes this hard is that the symptom (a churned customer) is downstream of three different decisions — when to retry, what to say in-app during retries, and when to actually revoke access. Each one has to read from the same source of truth or they drift, and the source of truth is Stripe's subscription.status combined with the decline code on the most recent invoice attempt. The pattern below makes that source of truth explicit and gives every other system in the stack — billing emails, product access, in-app banners, finance reports — exactly one place to read it from.

Two engineers intensely collaborate, pointing at a large monitor displaying an abstract, colorful data flow diagram for a SAAS Stripe dunning recovery state machine.

What changes for your business

Model dunning as an explicit state machine that wraps Stripe's subscription lifecycle. The states are the ones Stripe already publishes — active, past_due, unpaid, canceled — and the transitions are driven by Stripe webhooks landing in your handler. The wrapper exists for one reason: every other part of the application reads from the wrapper, not from a cached subscription object scattered across services. When Stripe transitions a subscription, exactly one row in your database updates, and every downstream surface (in-app banner, access middleware, dunning email scheduler, finance audit table) reads from that row.

Subscription state table and transitions log

Start with the state table. It mirrors Stripe but carries the extra context you need to drive messaging and access decisions:

CREATE TABLE subscription_state (
  customer_id              text PRIMARY KEY,
  stripe_subscription_id   text NOT NULL UNIQUE,
  status                   text NOT NULL,        -- mirrors Stripe subscription.status
  previous_status          text,
  status_changed_at        timestamptz NOT NULL,
  last_invoice_id          text,
  last_decline_code        text,                 -- e.g. 'expired_card', 'insufficient_funds'
  last_decline_category    text,                 -- 'soft' | 'hard' | 'authentication' | 'none'
  next_retry_at            timestamptz,          -- from invoice.next_payment_attempt
  retry_attempt_count      int  NOT NULL DEFAULT 0,
  grace_period_ends_at     timestamptz,          -- set when entering past_due
  access_level             text NOT NULL,        -- 'full' | 'limited' | 'revoked'
  updated_at               timestamptz NOT NULL DEFAULT now()
);

CREATE TABLE subscription_state_transitions (
  id                       bigserial PRIMARY KEY,
  customer_id              text NOT NULL,
  from_status              text,
  to_status                text NOT NULL,
  trigger_event_id         text NOT NULL,        -- Stripe event.id that caused the transition
  trigger_event_type       text NOT NULL,
  decline_code             text,
  occurred_at              timestamptz NOT NULL,
  metadata                 jsonb
);

Append-only transitions ledger

The transitions table is append-only. Every state change is one row. That table is what finance reads when they want to know "how many customers went from past_due to canceled last month and what was the decline code on each one." Without it, you are reconstructing that answer from invoice metadata and Stripe logs, and the answer is usually a guess.

Knowable status inputs from Stripe

The Stripe documentation is explicit about which subscription statuses exist and what causes the transitions, so the state machine has a small, knowable set of inputs. invoice.payment_failed is the canonical trigger for active to past_due. The Dashboard setting under Billing > Revenue recovery determines whether the post-retry transition is past_due to unpaid, past_due to canceled, or stays past_due. The default behavior is automatic cancellation after up to eight unsuccessful billing attempts, which is the recommended Smart Retries window, but the cancellation behavior is a per-account configuration and your code has to handle whichever one is set.

Webhook handler as single write path

The webhook handler is the single entry point. It receives the event, classifies the transition, updates the state row in one transaction with the transitions audit row, and emits a domain event for downstream consumers:

type SubscriptionStatus =
  | "active"
  | "past_due"
  | "unpaid"
  | "canceled"
  | "incomplete"
  | "incomplete_expired"
  | "trialing"
  | "paused";

type DeclineCategory = "soft" | "hard" | "authentication" | "none";

const HARD_DECLINE_CODES = new Set([
  "expired_card",
  "incorrect_number",
  "incorrect_cvc",
  "incorrect_zip",
  "incorrect_pin",
  "stolen_card",
  "lost_card",
  "restricted_card",
  "invalid_account",
  "card_not_supported",
]);

const SOFT_DECLINE_CODES = new Set([
  "insufficient_funds",
  "card_velocity_exceeded",
  "processing_error",
  "issuer_not_available",
  "reenter_transaction",
]);

function classifyDecline(code: string | null): DeclineCategory {
  if (!code) return "none";
  if (HARD_DECLINE_CODES.has(code)) return "hard";
  if (SOFT_DECLINE_CODES.has(code)) return "soft";
  if (code === "authentication_required" || code === "authentication_not_handled") {
    return "authentication";
  }
  // Unknown codes are treated as soft by default — Stripe will retry, and a
  // hard-decline misclassification is worse than a soft one.
  return "soft";
}

async function handleInvoicePaymentFailed(
  event: Stripe.Event,
  invoice: Stripe.Invoice,
): Promise<void> {
  const subscriptionId = invoice.subscription as string;
  const declineCode = invoice.last_finalization_error?.decline_code
    ?? invoice.charge_failure_code
    ?? null;
  const category = classifyDecline(declineCode);

  await db.transaction(async (tx) => {
    const inserted = await tx.processedEvents.insertIfAbsent({
      event_id: event.id,
      received_at: new Date(),
    });
    if (!inserted) return;

    const current = await tx.subscriptionState.findByStripeId(subscriptionId);
    if (!current) return;

    // A hard decline means continuing to retry against this card is wasted
    // budget. Mark the row so the dunning email scheduler uses the
    // card-update template instead of the generic retry template.
    const nextRetryAt = category === "hard"
      ? null
      : invoice.next_payment_attempt
        ? new Date(invoice.next_payment_attempt * 1000)
        : null;

    await tx.subscriptionState.update(current.customer_id, {
      status: "past_due",
      previous_status: current.status,
      status_changed_at: new Date(),
      last_invoice_id: invoice.id,
      last_decline_code: declineCode,
      last_decline_category: category,
      next_retry_at: nextRetryAt,
      retry_attempt_count: current.retry_attempt_count + 1,
      grace_period_ends_at: current.grace_period_ends_at
        ?? addDays(new Date(), 14), // mirrors default 2-week Smart Retries window
      access_level: "limited",
      updated_at: new Date(),
    });

    await tx.subscriptionStateTransitions.insert({
      customer_id: current.customer_id,
      from_status: current.status,
      to_status: "past_due",
      trigger_event_id: event.id,
      trigger_event_type: event.type,
      decline_code: declineCode,
      occurred_at: new Date(),
      metadata: { invoice_id: invoice.id, attempt: invoice.attempt_count },
    });

    await emitDomainEvent({
      type: "subscription.payment_failed",
      customer_id: current.customer_id,
      decline_category: category,
      attempt: invoice.attempt_count,
    });
  });
}

Access-level as computed column

Notice the access_level field. It is computed from the state, not stored independently. The middleware that gates product features reads access_level and does not read Stripe's subscription object directly. When a customer transitions past_due they get limited (read-only, banner shown). When they transition to unpaid or canceled they get revoked (hard wall, card-update CTA). When they pay and Stripe sends customer.subscription.updated with status back to active, the handler sets access_level to full in the same transaction. There is no other path that writes to access_level. That constraint is what makes the system safe to reason about — every access decision is one column read.

Email scheduler keyed on state

The dunning email scheduler reads from the state table on a cron. For each customer in past_due with next_retry_at in the future, it sends the "we will retry on X" email at a sensible cadence (typically 24 hours before the next retry and immediately after a retry fails). For customers in past_due with last_decline_category = 'hard', it sends the card-update email instead — there is no point waiting for a retry that will not succeed. For customers transitioning into unpaid, it sends the "your access has been suspended" email and the in-app middleware starts returning the wall. For customers transitioning into canceled, it sends the reactivation email with the re-subscribe link.

A confident startup founder gazes calmly at a monitor, which displays a clear, color-blocked dashboard with abstract, upward-trending graphs, reflecting successful SAAS Stripe dunning recovery.

More on this

Common failure modes

The first one is treating the first failed payment as a customer crisis. A first failure is overwhelmingly a soft decline, and Smart Retries — Stripe's AI-driven retry layer — will recover most of them silently within the retry window. Acting on the first failure means surfacing a panic message to a customer whose card was going to clear on the next attempt anyway. The state machine's grace_period_ends_at field exists to enforce this: nothing user-visible escalates past the "limited / informational banner" tier until the grace period closes or Stripe transitions the subscription past past_due on its own.

The second is reading subscription.status from a cached mirror instead of the state row. Teams will sync Stripe subscriptions into their own data warehouse for analytics, and over time some application code starts reading from the warehouse mirror because it is faster. The mirror is six hours behind the webhook. A customer pays at 14:00, the webhook fires at 14:00:02, the state row updates at 14:00:03, and the product gives them access at 14:00:03. The warehouse mirror updates at 20:00. Any code reading from the mirror tells the customer they are still suspended for six hours after they paid. Pick one source of truth — the state row — and route every read through it.

The third is mis-classifying decline codes. The decline_code on an invoice is not consistently the most useful field; sometimes the underlying charge failure_code carries the real reason while the invoice surface shows a generic card_declined. The classifier has to inspect both, and the default behavior on an unknown code should be soft (let Stripe retry) rather than hard (stop retrying). A hard-decline misclassification means you stopped trying on a card that would have cleared; a soft-decline misclassification means Stripe burns one extra retry attempt against a dead card, which costs almost nothing.

The fourth is handling subscription.deleted as a final cancellation when the customer scheduled a future cancellation. cancel_at_period_end = true allows the subscription to run through the period the customer already paid for and only triggers subscription.deleted at the end of that period. If the state machine treats every customer.subscription.deleted as immediate access revocation, customers who scheduled an end-of-period cancellation lose access early. The fix is to read cancel_at and cancel_at_period_end from the subscription on the cancellation event and only revoke when the cancellation is actually effective.

The fifth is the incomplete-vs-incomplete_expired confusion. A new subscription that fails its first payment lands in incomplete and the customer has a 23-hour window to make a successful payment before it moves to incomplete_expired. Those subscriptions do not bill the customer — they exist in Stripe but they are not collectible. The state machine has to know that incomplete_expired is terminal for new-signup flows and should route the customer back to the onboarding payment step, not into the dunning recovery flow that exists for established subscriptions.

The sixth is failing to drain in-flight webhooks before a manual intervention. Support cancels a subscription manually in the Stripe Dashboard. The dashboard fires the cancellation webhook, but before it lands, an in-flight invoice.payment_failed from earlier in the day arrives first. The state machine transitions to past_due, then milliseconds later transitions to canceled. The downstream consumers see both events and may send two emails. Make every state transition idempotent on the trigger event ID (the processedEvents dedupe is the mechanism) and order downstream emails on the resulting state, not on every transition that produced it.

What this looks like in production

At BFEAI we run a usage-based credit billing model on top of Stripe subscriptions, which means failed payments are not just a future-revenue problem — they are a current-quota problem. A customer in past_due is a customer who might be actively using credits they have not paid for, and the state machine has to make the access decision in milliseconds on every API call. The state row is the read path: one indexed lookup by customer_id, one column check on access_level, decision made. That column is the only thing the request path needs to know.

The dashboard a CTO wants on top of this has four numbers. Customers in past_due (the current dunning pool). Customers in past_due with last_decline_category = 'hard' (cannot recover without customer action). Recovery rate over the last 30 days, defined as the count of past_due to active transitions divided by the count of active to past_due transitions in the same window. And cancellation lead time, defined as median hours between the first failed payment and the eventual canceled transition for cohorts that ended in cancellation. Those four numbers tell you whether the recovery flow is working and where the leak is.

The recovery-rate number is the one that drives product decisions. If it is above 70%, Smart Retries plus the in-app messaging are doing their job. If it sits below 50%, something is structurally wrong — either the in-app banner is invisible to the customer, the card-update flow is too many clicks, or the decline-code classifier is misrouting customers who could have updated their cards into a generic retry loop. The metric is downstream of all three; investigation starts by segmenting the recovery rate by decline_code and looking for the bucket where the rate collapses.

The alert that pages a human is not "a payment failed" — payments fail constantly and Smart Retries handles it. The alert that matters is "more than N customers transitioned from past_due to canceled in the last hour" — a spike in canceled-from-past_due means either Stripe is having a regional incident, a payment processor is offline for a particular card brand, or the team just shipped a change that broke the recovery flow. The state machine's transitions table makes that query trivial; without it, you are joining invoice events across Stripe webhooks and your application logs to reconstruct the same answer.

The emails are templated against the state transitions. There is no email-sending code anywhere in the webhook handler. The handler updates the state row and writes the transitions audit row. A separate worker reads the transitions table on a cron, decides whether the transition warrants an email based on the from-to pair and the decline category, and sends through the transactional email provider with idempotency keyed on (customer_id, transition_id, template). That separation matters: when the email provider has an outage, the state machine keeps working and the emails catch up when the provider comes back. When the state machine has a bug, the emails do not get sent for transitions that should not have happened.

The reactivation flow on the canceled side is its own design problem worth calling out. A canceled subscription in Stripe is terminal — you cannot un-cancel it. The reactivation flow has to create a new subscription, which means the in-app flow that lets a returning customer "resume their plan" is actually creating a new Stripe subscription with the same price IDs and capturing a new payment method. The state row gets a new stripe_subscription_id, the access_level returns to full, and a transitions row is written from canceled to active with a synthetic trigger event tagged "manual_reactivation." That tag is what lets the cohort analysis later distinguish naturally-active customers from reactivated ones, which behave differently on churn.

What to watch in your own implementation

Open your codebase and search for direct reads of subscription.status outside the webhook handler. Every one of them is a potential drift source — a code path that might be reading a cached value instead of the state table. Replace each with a read from the state table or the access_level column derived from it. The migration is mechanical and the benefit is that you suddenly have one place to look when access behavior surprises you.

Then look at your handling of invoice.payment_failed. Does the handler distinguish hard from soft declines, or does every failure look the same? If the latter, you are burning Smart Retries budget against expired and stolen cards that will not authorize, and your customers are getting "we will retry" emails for cards Stripe has already given up on. Even a small classifier — the hard-decline set is short — recovers a meaningful share of attempts to the card-update flow where they belong.

Then look at your subscription cancellation path. Does the application revoke access on customer.subscription.deleted, or does it read cancel_at_period_end and cancel_at and only revoke when the cancellation is actually effective? The former is the bug that quietly cancels access early on customers who paid through the period; the latter is the right behavior.

Finally, look at your past_due state. How many subscriptions are sitting in past_due right now and how long have they been there? Anything older than the configured retry window is either stuck (your Dashboard setting is "leave past_due" and you forgot you set it) or actively bleeding (Stripe is generating new invoices for a customer who is not paying). Either case is a number to take to your team. The state machine surfaces the question; the Dashboard setting determines the answer.

Outcomes you should expect

What this delivers

  • Failed payments recover through Stripe's Smart Retries window without losing the customer to a single declined charge.
  • Subscription status transitions (active to past_due to unpaid or canceled) drive in-app messaging and access automatically — no manual ops involvement.
  • Hard declines like expired_card and stolen_card route to a card-update flow instead of burning retry budget against a card that will not succeed.
  • Finance gets a deterministic answer to 'when does revenue actually leak' because the dunning state machine writes one audit row per state transition per customer.

Primary sources

By the numbers

  • Stripe's default Smart Retries recommendation is 8 retry attempts within 2 weeks, with configurable windows of 1 week, 2 weeks, 3 weeks, 1 month, or 2 months.

    Source ↗

  • Subscriptions cancel automatically after up to eight unsuccessful billing attempts by default, and the exact number is configurable in Dashboard subscription settings.

    Source ↗

  • After Smart Retries are exhausted, a subscription transitions to canceled, unpaid, or remains past_due depending on the configured failure setting in the Dashboard.

    Source ↗

  • A subscription enters past_due when payment on the latest finalized invoice fails or is not attempted, and it continues to create invoices while in that state.

    Source ↗

  • When a subscription is marked unpaid, the latest invoice stays open, new invoices continue to generate, but Stripe stops attempting payment because retries already ran during past_due.

    Source ↗

  • An incomplete subscription that does not receive a successful payment within 23 hours of creation moves to incomplete_expired, and those subscriptions do not bill the customer.

    Source ↗

  • The canceled state is terminal — a canceled subscription cannot be updated except for metadata and cancellation_details, and resubscribing requires creating a new subscription with new payment information.

    Source ↗

Live in production today

The same engineering, shipped in production at BFEAI.

I'm co-founder & CTO of Be Found Everywhere (BFEAI), a 7-app AI SaaS platform running today. The work I deliver for clients is the work I do every week on my own platform.

7

Production apps

200K+

Keywords generated

1,500+

AI scans run

7,000+

Sites automated

Common questions

What buyers ask before reaching out

What is the difference between past_due and unpaid in Stripe?

Both mean the customer has not paid. The difference is what Stripe is doing about it. In past_due, Stripe is still actively retrying the latest invoice under your Smart Retries schedule and new billing periods continue to generate invoices. In unpaid, retries are done — Stripe has exhausted the configured attempts, the latest invoice stays open, and new invoices keep generating but no payment is attempted. unpaid is the signal to revoke product access; past_due is the signal to nudge the customer.

How many times will Stripe retry a failed subscription payment by default?

The default Smart Retries recommendation is 8 attempts spread across 2 weeks, and the documented configurable windows are 1 week, 2 weeks, 3 weeks, 1 month, or 2 months. You can also disable Smart Retries entirely and configure up to 3 custom retries with explicit day offsets between attempts. Most SaaS teams I see start on the default and only move to custom schedules once they have decline-code data showing the AI is over- or under-retrying their particular customer base.

Should I cancel a subscription on the first failed payment?

No. A first failure is overwhelmingly a soft decline — insufficient funds today, an issuer that is briefly unreachable, a network 3DS handshake that did not complete. Canceling on the first failure throws away the revenue that Smart Retries would have recovered for you. The right behavior is to enter past_due, let the retry schedule run, change the in-app messaging, and only act once Stripe transitions the subscription to unpaid or canceled on its own.

What is the difference between a hard decline and a soft decline?

A soft decline is a temporary failure that may succeed on retry — insufficient_funds, processing_error, issuer_not_available, authentication_required. A hard decline is permanent: expired_card, incorrect_number, stolen_card, lost_card, restricted_card. Retrying a hard decline burns your retry budget against a card that will not authorize. The dunning state machine inspects the decline code on each attempt and short-circuits the retry loop on hard declines, routing directly to a card-update flow.

Should I run my own retry schedule or use Smart Retries?

Start with Smart Retries. The behavioral signals — device count, time-of-day optimization, country-specific retry timing — are not signals you can replicate by hand. Move to a custom schedule when you have a specific reason: a regulated geography that requires specific notification cadence, a B2B contract that promises a particular grace period, or decline-code data showing the default schedule is misaligned with your customers. Even then, keep the same state machine on top — only the retry-trigger logic changes.

What happens to a subscription that stays in past_due forever?

It depends on the Dashboard setting under Billing > Revenue recovery. After the retry window closes, the subscription transitions to canceled, transitions to unpaid, or remains in past_due indefinitely while continuing to generate invoices. The default is to cancel after up to eight unsuccessful billing attempts. Most SaaS teams want canceled or unpaid as the terminal state — leaving a subscription in past_due forever means new invoices keep accruing for a customer who is not paying.

How should in-app messaging change as the customer moves through dunning?

The state determines the tone. In past_due, the message is informational — a banner that says 'your last payment did not go through, we will retry on date X.' In unpaid, the message is the access wall — the product is read-only or hidden behind a card-update CTA. In canceled, the message is reactivation — re-onboarding flow, new payment method capture, and a re-create-subscription call. Tying messages directly to subscription.status means the in-app state and the billing state cannot drift.

Why model dunning as a state machine instead of a series of if-statements in the webhook handler?

Because the webhook handler is reactive — it only knows about the event it just received. A state machine is the durable record of where each customer is in the recovery flow, which lets you ask 'how many customers are currently in past_due with one retry remaining' as a query instead of a guess. It also makes the in-app messaging, the email cadence, and the access-control checks all read from one source of truth instead of three subtly different copies of the same logic.

Ready to see if this is a fit?

A 15-minute call. No deck, no slides. We talk about what you're shipping and where engineering is the bottleneck. Either way, you walk away with a senior engineer's read on your situation.