A CTO looks frustrated at a monitor showing a red alert, reflecting the critical impact of an LLM model fallback chain failure.
Production engineering patternUpdated

LLM Model Fallback Chain: Stay Up When Your Primary Model Is Down

An engineering pattern for keeping LLM-powered features responsive when Claude is overloaded, OpenAI is rate-limited, or a single model is misbehaving — built around error classification, ordered failover, and per-call spend caps.

The problem

The incident usually looks the same. A customer kicks off a feature that depends on an LLM call — a summarizer, an agent, a chatbot. Your code POSTs to Anthropic. Anthropic returns a 529 because the API is temporarily overloaded across all users. Your SDK wrapper bubbles a 500 to the frontend. Your dashboard lights up. The on-call engineer opens the runbook, which says "wait for Anthropic to recover," and that is the entire mitigation. The feature is down. The customer files a ticket. You ship a Slack post-mortem two hours later that admits there was nothing you could do.

Single provider as single point of failure

The bug is not Anthropic being overloaded. The bug is that your application treated a single provider as a single point of failure. Anthropic documents 529 explicitly as an overload condition that can occur when APIs experience high traffic across all users — it is a normal operating state of a shared service, not an edge case. OpenAI returns 503 with similar semantics. Any production SaaS that calls one model from one provider and treats the call as "it works or the page is broken" has the same bug, regardless of which provider it picked.

Rate-limit and timeout blast radius

The same shape shows up at smaller scales. A burst of usage from one tenant pushes you over the per-model RPM limit on Claude Opus and you get a 429 with a retry-after header. The header says wait 14 seconds. Every customer call in flight is now blocked for 14 seconds, your latency P99 doubles, and the support queue fills with "is the app slow?" messages. A network blip causes a 504 timeout on a long agent call. Your retry layer fires the same call again, doubling your spend on that one request. The user gets the same delay either way.

Compounded failure probability

The business consequence is that your LLM feature is only as available as the least available of: the model you picked, the provider's API, your account's rate-limit tier, and your retry policy. Composing those four things multiplies the failure probability instead of dividing it. The pattern below is what gets the failure probability back down by trading "depend on one model" for "have a graded, evaluated, spend-capped sequence of models you can route to under load."

Three engineers collaborate, tracing abstract shapes on a large screen, building an effective LLM model fallback chain runtime.

What changes for your business

The architecture has four pieces: an error classifier that knows the difference between rate-limit, overload, timeout, server error, and content filter; an ordered chain configuration that says "if the primary fails this way, try this next"; a per-call budget enforcer that caps tokens and wall-clock time across the whole chain; and an eval harness that grades each model on a frozen task set before it enters the chain. They live behind a single TypeScript wrapper that replaces direct SDK calls everywhere in the codebase.

Error classifier normalizing provider codes

Start with the error classifier. Different providers return different status codes for what is functionally the same condition, and the chain needs to branch on the condition, not the wire format. Anthropic returns 429 for rate-limit, 529 for overloaded, 504 for timeout, and 500 for an internal error. OpenAI returns 429 for rate-limit and 503 for overloaded. The classifier normalizes these into a small enum the chain can act on:

type FailureClass =
  | "rate_limit"        // 429 — provider's capacity for *your account*
  | "overloaded"        // 529 (Anthropic) / 503 (OpenAI) — provider-wide
  | "timeout"           // 504 or client-side abort
  | "server_error"      // 500 or other 5xx
  | "content_filter"    // refused by safety layer — terminal, do not retry
  | "invalid_request";  // 400 — your bug, do not retry

interface ClassifiedFailure {
  class: FailureClass;
  provider: "anthropic" | "openai";
  model: string;
  retryAfterSeconds?: number;   // populated from retry-after header on 429
  originalError: unknown;
}

function classifyAnthropic(err: Anthropic.APIError): ClassifiedFailure {
  if (err.status === 429) {
    return {
      class: "rate_limit",
      provider: "anthropic",
      model: err.model ?? "unknown",
      retryAfterSeconds: parseRetryAfter(err.headers?.["retry-after"]),
      originalError: err,
    };
  }
  if (err.status === 529) return classOf("overloaded", "anthropic", err);
  if (err.status === 504) return classOf("timeout", "anthropic", err);
  if (err.status >= 500)  return classOf("server_error", "anthropic", err);
  if (err.status === 400 && /content/i.test(err.message)) {
    return classOf("content_filter", "anthropic", err);
  }
  return classOf("invalid_request", "anthropic", err);
}

Chain configuration as data

The chain configuration is data, not code. Each chain is an ordered list of attempts, each attempt names a provider-model pair and a per-attempt token cap. The reason it is data is so you can swap a chain without a deploy and so each chain carries its eval score in version control:

interface ChainStep {
  provider: "anthropic" | "openai";
  model: string;
  maxOutputTokens: number;
  evalScoreOnFrozenSet: number;   // 0..1 — gates entry into the chain
}

interface ChainConfig {
  name: string;
  // Routing rules: failure_class -> next step index in steps[]
  // "stay" means try same step again after wait
  routes: Record<FailureClass, "stay" | "next" | "terminal">;
  steps: ChainStep[];
  perCallBudget: {
    maxTotalTokens: number;
    maxWallClockMs: number;
    maxAttempts: number;
  };
}

const reasoningAgentChain: ChainConfig = {
  name: "reasoning-agent-v3",
  routes: {
    rate_limit:    "next",   // primary's pool is full → sibling/cross-provider
    overloaded:    "next",   // provider-wide outage → cross-provider
    timeout:       "stay",   // one slow request → one retry on same model
    server_error:  "next",   // sticky 500 → move on
    content_filter:"terminal",
    invalid_request:"terminal",
  },
  steps: [
    { provider: "anthropic", model: "claude-opus-4-8",   maxOutputTokens: 4000, evalScoreOnFrozenSet: 0.88 },
    { provider: "anthropic", model: "claude-sonnet-4-6", maxOutputTokens: 4000, evalScoreOnFrozenSet: 0.81 },
    { provider: "openai",    model: "gpt-4.1",           maxOutputTokens: 4000, evalScoreOnFrozenSet: 0.79 },
  ],
  perCallBudget: { maxTotalTokens: 12000, maxWallClockMs: 45000, maxAttempts: 4 },
};

Stay-vs-next routing decision

The "stay vs next" decision is the single most consequential thing in the chain config. Stay-on-provider is correct when the failure is transient and local to one request — a timeout, a single 500. Next-step is correct when the failure is a property of your account's pool or the provider's whole infrastructure — 429 with a retry-after of 14 seconds, 529 overloaded. The retry-after header on 429 is documented as a hard floor: earlier retries will fail. If the header says 14 seconds and your P95 SLO is 8, "stay" is actively the wrong answer — you have already missed the SLO, you should be on a sibling pool by the time the customer's request is finishing.

Per-call budget enforcer

The wrapper is what every feature in the codebase calls. No code calls the Anthropic or OpenAI SDK directly. The wrapper enforces the per-call budget, runs the chain, and writes one log row per attempt:

async function llmCallWithChain(
  chain: ChainConfig,
  request: LLMRequest,
  ctx: { requestId: string; tenantId: string; feature: string },
): Promise<LLMResponse> {
  const startedAt = Date.now();
  let totalTokens = 0;
  let attemptCount = 0;
  let stepIdx = 0;

  while (true) {
    if (attemptCount >= chain.perCallBudget.maxAttempts) {
      throw new ChainExhaustedError("max attempts", ctx);
    }
    if (Date.now() - startedAt >= chain.perCallBudget.maxWallClockMs) {
      throw new ChainExhaustedError("wall-clock budget", ctx);
    }
    if (totalTokens >= chain.perCallBudget.maxTotalTokens) {
      throw new ChainExhaustedError("token budget", ctx);
    }

    const step = chain.steps[stepIdx];
    if (!step) throw new ChainExhaustedError("end of chain", ctx);

    const attemptId = crypto.randomUUID();
    attemptCount += 1;

    try {
      const result = await callProvider(step, request);
      totalTokens += result.usage.totalTokens;
      await logAttempt({
        ...ctx, attemptId, step, result, status: "ok",
        startedAt, finishedAt: Date.now(),
      });
      return result;
    } catch (err) {
      const classified = classify(err, step);
      totalTokens += estimateTokensSpentOnError(step, request);
      await logAttempt({
        ...ctx, attemptId, step, error: classified, status: "failed",
        startedAt, finishedAt: Date.now(),
      });

      const route = chain.routes[classified.class];
      if (route === "terminal") throw err;
      if (route === "stay") {
        await sleep(decideWait(classified));   // honor retry-after on 429
        continue;
      }
      // "next"
      stepIdx += 1;
    }
  }
}

Per-attempt log keyed by logical request

The per-attempt log is the thing that makes the chain debuggable. Each row carries the logical request ID, the attempt ID, the step's provider and model, the failure class if any, the tokens, the cost, the wall-clock. The dashboard groups by logical request ID, so one customer request that hopped from Opus to Sonnet to GPT shows up as one row with three children. The aggregated view answers "how often does Opus fail and Sonnet recover" without anyone writing a one-off query during an incident.

Eval gate on every chain entry

The eval harness is the part teams skip and then regret. Before a model enters the chain, you run a frozen set of fifty to a few hundred prompts against it, score the responses against a reference set or a model-grader rubric, and record the score. The chain config carries that score per step. When the chain config is updated, the build runs the eval against the new step and fails if the score is below the chain's documented floor. The reason this matters is that "Sonnet handled the overflow" is a sentence you want to be able to say with a number behind it, not a hope.

A confident CTO smiles, viewing a dashboard with green charts, signifying a successful LLM model fallback chain runtime SaaS.

More on this

Common failure modes

The first sharp edge is treating 429 rate-limit and 529 overloaded as the same condition. They are not. A 429 says your account's pool for this model class is full — the same Opus 4.7 call from a different organization is fine, and a Sonnet call from your same account is also fine because the limits are per model class. A 529 says the whole provider is under load and any Anthropic call is at risk. The chain config has to route them differently or you waste retries on the wrong response. A naive "retry on 5xx" handler routes a 529 to the same provider and times out the customer.

The second sharp edge is the Opus-version pool. Anthropic's documented behavior is that the Opus rate limit is a single shared total across all Opus model versions — Opus 4.8, 4.7, 4.6, 4.5, 4.1, and 4 all draw from one pool. If your chain falls back from Opus 4.8 to Opus 4.7 on a 429, you have not actually moved to a fresh pool — you are still throttled. Falling back from Opus to Sonnet is a real fallback because they are separate pools. The chain config needs to know the difference, and the easiest way is to write the per-step provider+model as an opaque key and store the pool-membership graph separately so the chain validator catches "this fallback is on the same pool" at config time.

The third sharp edge is the runaway agent loop. An agent that calls a tool, gets a malformed response, calls the model again to recover, fails the second tool call, calls the model a third time — without a chain-level budget, this can hit dozens of model calls in a single user request. Each one might successfully run the fallback chain. The per-call dollar bill on a stuck agent can pass a paying customer's monthly subscription in under a minute. The budget enforcer is the only thing that stops it. The token cap, the wall-clock cap, and the max-attempts cap are all needed because each one catches a different failure shape — a fast-failing agent burns attempts, a slow one burns wall-clock, a verbose one burns tokens.

The fourth sharp edge is content-filter response handling. When a provider's safety layer rejects an input, falling through to a different provider is usually wrong. The two providers were trained on overlapping safety signals, so the second one is likely to reject the same content, and now you have leaked the prompt to a second provider you did not need to. The right pattern is to mark content-filter responses as terminal in the chain config and surface them to the calling code as a typed error the application can handle — typically by changing the user-facing message, not by retrying the model.

The fifth sharp edge is the eval-drift trap. A chain that was evaluated six months ago against a frozen prompt set is not the same chain today, because the provider has updated the model under the same name. Anthropic publishes model deprecations and changes; OpenAI rolls minor versions. The eval has to run on a schedule against the chain in production, not just at chain-change time, and the dashboard needs a recency indicator on the eval score so the on-call engineer reading it during an outage knows whether the number is current or stale. A stale eval score is worse than no eval score because it manufactures false confidence.

The sixth sharp edge is the retry-after header race. On a 429, Anthropic returns a retry-after value that is a hard floor — retries before it elapses will fail. If your chain decides to "stay" on the same model after a 429, it has to sleep at least that long. If the customer's request has a tighter SLO than the retry-after value, "stay" is the wrong route and the chain config should send 429 to "next" for that feature. The same chain logic can be correct for one feature (slow background agent: stay) and wrong for another (real-time chatbot: next), which is why the chain config is per-feature data, not a global default.

What this looks like in production

At BFEAI we run the chain pattern in front of every LLM call across the seven apps. The wrapper is a single TypeScript module that every feature imports. Each chain config lives in its own file and is reviewed like any code change, but the chain itself is data — a JSON-shaped object the wrapper loads and walks. We have separate chain configs for the long-form agent (Opus → Sonnet → GPT, tolerant of latency), the short summarizer (Sonnet → Haiku → GPT, optimized for cost), and the structured-extraction endpoint (Sonnet → GPT, optimized for JSON reliability). The eval suite for each chain lives next to it and runs on every chain change plus weekly against production.

The dashboard that matters in an incident has four lines: percentage of requests served by the primary step, percentage served by each fallback step, percentage that hit "chain exhausted," and the rolling average dollar cost per logical request. In normal operation the first line is 95%+, the fallback lines are a percent or two each, and the exhausted line is essentially zero. When Anthropic has a 529 incident, the primary line drops, the cross-provider fallback line rises, and the cost-per-request line ticks up a few percent — and customer-facing features keep responding. The on-call engineer's job during that window is to read the dashboard, confirm the chain is doing what it should, and not page anyone else.

The alert that matters most is not "primary model is failing." Primary models fail, the chain handles it, and the alert would page on every Anthropic blip. The alert that pages is "chain exhausted rate > 0.1% over a 5-minute window" — which means the fallback steps are also failing, and that is a real cross-provider event that needs eyes. The second alert is "per-call dollar cost > 5x baseline over a 5-minute window" — which catches both the runaway-agent case and the case where the chain is hopping to a more expensive model for a high fraction of traffic and the cost line is moving. Both alerts come from the same per-attempt log; no separate telemetry pipeline.

The runbook for "chain exhausted rate elevated" is short by design. Step one: which chain is exhausting — open the dashboard filter by chain name. Step two: which step is the last one being attempted — that tells you whether you ran out of providers (need a new chain step) or hit the per-call budget (need a budget review). Step three: are eval scores for the current steps stale — if yes, the recovery action is to refresh evals, not to add steps. Step four: only if the answers above do not explain it, declare an incident and engage the providers. Most of the time the answer is one of the first three and the on-call closes the page in under fifteen minutes.

One operational detail worth calling out: when a chain is updated, the change has to go through the eval gate before it ships. We treat the chain config as production code, not configuration, because a chain change with no eval is the same shape of risk as a model swap with no eval — it changes the behavior customers see and you do not know how. The CI step that runs the eval against a candidate chain is the same one that runs unit tests; a chain that fails the score floor blocks the merge. The discipline is annoying for fifteen seconds and saves a postmortem.

The cost-tracking implication is the part finance cares about. Each chain hop is a real dollar that customers caused but the chain decided to spend, and you owe finance a clear answer to "what did the failover cost us last month." The per-attempt log answers it directly: group by fallback reason, sum the dollar cost, divide by the count of primary-failure events you avoided showing to customers. The number is rarely large in absolute terms — fallback events are a fraction of traffic — but having the number means the chain is not a black box on the cost side, and finance signs off on the pattern instead of asking quarterly why the OpenAI line is non-zero on a Claude-first stack.

What to watch in your own implementation

Open your codebase and search for direct imports of the Anthropic or OpenAI SDK. Every one of those call sites is a place the chain pattern is bypassed. The first refactor is to introduce the wrapper module and migrate the call sites one at a time, leaving the chain config trivially short — a single step that is just "call the primary, surface the error" — so the migration is purely structural and not behavior-changing. Once every call site goes through the wrapper, you can lengthen the chains without touching feature code.

Then go through your error handling and classify what you actually do on a 429, a 529, a 503, and a 504 today. If the answer is "retry three times then 500 to the user" for all of them, you have the bug this pattern solves. The fix is to wire the classifier and let the chain config decide per failure class — and to write down per-feature which features get "stay" and which get "next" on a 429, because that decision depends on the feature's latency SLO and is not a global default.

Then look at your spend telemetry. Can you answer "in the last hour, what did each LLM call cost and what was the request ID that caused it" without writing a one-off query? If not, the per-attempt log goes in before the chain — without it, the chain hides cost from you instead of exposing it. The log is small (one row per attempt), keyed by a logical request ID, and is the single source of truth for both reliability and cost analysis. Once that table exists, the chain becomes a thing you can debug and the cost becomes a thing finance can defend.

Finally, write the frozen eval set today, even if it is twenty prompts. A small eval that runs is infinitely better than a comprehensive eval that does not. The score does not have to be perfect — it has to be reproducible and recent. The number you put on each chain step is what gives the on-call engineer the confidence to read the dashboard during the next Anthropic 529 window, see traffic flowing through Sonnet at 81%, and not page anyone. That number is what turns the fallback chain from a hope into a pattern you can ship.

Outcomes you should expect

What this delivers

  • Customer-facing AI features keep responding when Anthropic returns 529 or OpenAI returns 503, instead of bubbling a 500 to the user and burning the on-call rotation.
  • Per-call token and dollar budgets cap runaway agent loops before a single conversation costs more than a paying customer's monthly subscription.
  • A frozen eval suite runs against every model in the chain before it goes live, so the fallback is graded for quality and not just availability.
  • One TypeScript wrapper replaces direct SDK calls across the codebase, so a new model, a new provider, or a new error class is a one-file change instead of a refactor.

Primary sources

By the numbers

  • The Anthropic API returns a dedicated 529 overloaded_error when the API is temporarily overloaded, distinct from the 429 rate_limit_error returned when an account has hit a rate limit, so a fallback router can branch on the two cases independently.

    Source ↗

  • Anthropic 429 responses include a retry-after header indicating how long to wait before the next attempt, and earlier retries will fail — meaning the wait value is a hard floor for a same-provider retry, not a suggestion.

    Source ↗

  • Anthropic's Messages API rate limits are enforced per model class, and Opus is a single shared limit across Opus 4.8, 4.7, 4.6, 4.5, 4.1, and 4 — so an Opus 4.8 rate-limit error will also block Opus 4.7, but Sonnet 4.x has its own separate pool you can route to.

    Source ↗

  • OpenAI returns 429 when you are sending requests too quickly and 503 when the service is overloaded, and the recommended pattern for rate-limit handling is exponential backoff with jitter, performing a short sleep and retrying the unsuccessful request.

    Source ↗

  • OpenAI's 503 slow-down variant specifically advises reducing request rate to its original level, maintaining a consistent rate for at least 15 minutes, and then gradually increasing it — a constraint that pushes long overload windows toward cross-provider failover rather than same-provider retry.

    Source ↗

  • Anthropic 504 timeout_error responses indicate the request timed out while processing, and the documented mitigation is streaming for long-running requests — meaning a non-streaming timeout at the wrapper is a signal to either switch to streaming on the same model or fall through to a faster sibling.

    Source ↗

Live in production today

The same engineering, shipped in production at BFEAI.

I'm co-founder & CTO of Be Found Everywhere (BFEAI), a 7-app AI SaaS platform running today. The work I deliver for clients is the work I do every week on my own platform.

7

Production apps

200K+

Keywords generated

1,500+

AI scans run

7,000+

Sites automated

Common questions

What buyers ask before reaching out

When should I fail over to a different provider versus retrying the same one?

Branch on the error class. A 429 rate_limit_error on Claude Opus means you have a Sonnet-class pool and an OpenAI pool that are unaffected, so cross-provider or cross-class failover is the cheap fix. A 529 overloaded_error means Anthropic's infrastructure is under pressure across the API and a same-provider retry is wasted work — go to OpenAI. A 504 timeout might just be one slow request, so a single same-model retry with a tighter token budget is fine before you fall through.

Why is per-call budget enforcement part of the fallback pattern?

An agent loop that retries on every failure with no ceiling can rack up dollars in seconds when a model gets stuck thinking. The fallback chain has to enforce a token cap and a wall-clock cap per logical call, regardless of how many providers it tried. Without that ceiling, the chain turns one overloaded provider into an infinite-spend bug, and the loudest signal you get is the next morning's invoice.

How do I know the fallback model is actually good enough to ship to customers?

Run your eval suite against each model in the chain before you put it in the chain. The fallback is a graded entry, not a fire-extinguisher — if Sonnet scores 73% on your task and Opus scores 88%, you ship the chain knowing customer requests routed to Sonnet will land in a known quality band. The chain configuration carries the eval score with it, so the on-call engineer reading a dashboard during an outage knows what the routed traffic is getting.

What goes in a typical chain — what's the ordering?

A common pattern for an agent that needs reasoning is Opus 4.8 as the primary, Sonnet 4.6 as the same-provider fallback for 429 or 504, and a GPT-class model as the cross-provider fallback for 529. The order is driven by quality first and cost second, because availability is what the chain is solving for. Cheaper sibling models go above the cross-provider step so you exhaust the primary provider's capacity before paying the operational cost of a cross-provider switch.

Won't the eval cost a fortune if I run it on every model on every change?

Not if you scope it. The eval suite that gates the chain runs against a frozen task set of fifty to a few hundred prompts, not your full production traffic. On a typical agent the cost to grade four models against a hundred-prompt frozen set is a few dollars, and you run it on the chain change and on the model upgrade, not on every code push. The discipline is to refresh the frozen set quarterly, not to expand it endlessly.

How do I track cost when one request might hit two providers?

Log the model, provider, input tokens, output tokens, and dollar cost of each attempt on the same row keyed by the logical request ID. The cost dashboard sums by request, so a chain hop appears as one logical call with two underlying line items. You also flag the row with the fallback reason — rate_limit, overloaded, timeout — so the next week's investigation has the trail. Without that flag the cost line silently inflates and you cannot tell whether the chain saved you or cost you.

What about content filter rejections — those aren't really availability failures, are they?

They are not, and the chain should not treat them as one. A content-filter rejection on the primary usually means a different provider will reject the same input for the same reason, so falling through wastes calls and can leak prompts across providers in a way you don't intend. The chain pattern is to surface content-filter responses as a terminal error to the caller, not to retry — and to keep a separate metric on filter rejections so you see if the rate is climbing on a particular feature.

Ready to see if this is a fit?

A 15-minute call. No deck, no slides. We talk about what you're shipping and where engineering is the bottleneck. Either way, you walk away with a senior engineer's read on your situation.