A stressed engineer at a messy desk, illuminated by monitors showing abstract red error patterns, grappling with LLM agent failures.

Production engineering patternUpdated June 2026

LLM Agent Tool-Call Retry: The Loop That Survives Production

Q: When a tool errors, should I crash the agent or send the error back to the model?

Send it back as is_error: true (Anthropic) or a status string (OpenAI). The model recovers from it.

Q: What's the difference between a transient and a permanent tool error?

Retry transient errors silently with backoff. Return permanent errors to the model right away.

Q: Why do I need a maximum tool-call loop limit?

Without a cap, a confused model can burn unbounded money. 25 calls is a typical interactive ceiling.

Q: If the agent retries a tool call, won't I double-execute side effects?

Make the tool idempotent on tool_use_id. The loop doesn't have to know.

Q: What happens if the agent process dies in the middle of a tool loop?

Persist the messages array every turn. Resume replays from the last saved state.

Q: Should I let the model retry with different parameters, or force a specific retry?

Let the model choose. Hard-coded retries assume more than you actually know.

Q: How do I keep the conversation from blowing up the context window when the agent has been looping for a while?

Summarize old tool_use/tool_result pairs. Cap on both turns and tokens for long-running agents.

Q: Does parallel tool use change any of this?

Execute all parallel calls and return all results in one user message before the next request.

An engineering pattern for LLM agent loops that survive transient failures, return errors to the model as context, stay idempotent under retry, and recover from a mid-loop crash.

Get a 15-min architecture read

The problem

The agent works in the demo. You ship it. A week later the on-call channel has three flavors of incident.

Uncaught tool exception crashes the run

The first is the crash. A tool throws — the search API returned a 503, the weather endpoint timed out, the database connection got reset — and the entire agent run terminates with an uncaught exception. The user sees a generic error message. The work the agent was halfway through is gone. Nobody knows whether the side effects that already ran were idempotent, so the support engineer spends an hour reading logs to figure out whether the user got billed twice.

Unbounded tool-call loop burns budget

The second is the infinite loop. The model decides it needs to call a tool. The tool returns a confusing result. The model tries again with slightly different parameters. The tool returns another confusing result. After forty-three iterations the model is still iterating and the bill for that single user prompt is north of twenty dollars in API tokens and downstream tool calls. By the time anyone notices, three more users have triggered the same pattern.

Double-execution on retry after ACK timeout

The third is the double-execution. A tool call to your payment API times out. The retry layer fires the same call again. The first call actually succeeded — Stripe got it, processed it, and the ACK is just slow. Now you have two charges on the customer's card. The agent does not know. The model does not know. Your accounting layer is the one that finds out, two weeks later, when finance reconciles.

Happy-path docs leave the failure surface unspecified

The common shape across all three is the same: the loop between the model and the tools is not designed for the failure modes the production environment actually exposes. The Anthropic and OpenAI documentation both describe a clean happy-path loop — the model emits a tool call, you execute it, you return the result, the model continues. That description is accurate but incomplete. It does not tell you what to do when the tool fails, what to do when the model gets stuck, what to do when the same call runs twice, or what to do when the host process dies while the loop is half-finished.

The business consequence of getting this wrong is not abstract. A crashed agent is a worse user experience than no agent. A looping agent is a direct hit on the unit economics of every feature it touches. A double-executing agent is real money paid the wrong direction, and the worst case is the customer who gets quietly billed twice and does not tell you. None of these are model problems. They are loop problems, and the loop is yours to build.

Two engineers collaborate intently, one pointing at abstract flow diagrams on a whiteboard, designing an LLM agent tool call retry pattern.

What changes for your business

The pattern has five pieces. Each one closes one of the failure modes above. They compose into a single loop that you can run against either Anthropic's tool-use API or OpenAI's function-calling API with the same structure on top.

Errors flow back as conversation context

Anthropic documents this explicitly — when a tool throws, return the error message in the tool_result content with is_error: true, and Claude incorporates the error into its next response. OpenAI's pattern is the same shape with a different surface: you return a string in the role: tool message describing what went wrong, and the model decides what to do next. In both cases the agent does not crash. The error becomes a sentence in the conversation, and the model has the full context it needs to either retry with different parameters, switch to a different tool, or surface a clean explanation to the user.

The temptation when you first hit a tool error is to catch it in the loop and crash. Resist. The model is genuinely good at recovering from errors that are described well. The Anthropic docs are explicit about this: write instructive error messages, not generic ones. "Failed" tells the model nothing. "Rate limit exceeded. Retry after 60 seconds." tells the model to wait or try a different approach. The quality of error recovery is downstream of the quality of the error messages, and that is your code, not the model's.

Transient vs permanent error split inside the tool

Transient errors — HTTP 429, HTTP 5xx, connection reset, socket timeout — are infrastructure noise. The tool itself should retry them with exponential backoff before the model ever sees them. Permanent errors — HTTP 4xx other than 429, validation failures, not-found, forbidden — are decisions for the model. Send those back immediately with is_error: true. The split matters because every tool call you bounce back to the model costs another turn, another context window read, and another round of token spend. Absorbing transient noise inside the tool is the difference between a clean conversation history and one polluted with five back-and-forths about the same connection reset.

type ToolResult =
  | { ok: true; content: string }
  | { ok: false; isError: true; content: string };

async function executeToolWithRetry(
  fn: () => Promise<string>,
  opts: { maxAttempts?: number; baseMs?: number } = {},
): Promise<ToolResult> {
  const maxAttempts = opts.maxAttempts ?? 3;
  const baseMs = opts.baseMs ?? 250;

  let lastErr: unknown;
  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      return { ok: true, content: await fn() };
    } catch (err) {
      lastErr = err;
      if (!isTransient(err) || attempt === maxAttempts) {
        return {
          ok: false,
          isError: true,
          content: explainError(err), // instructive, not generic
        };
      }
      await sleep(baseMs * 2 ** (attempt - 1) + jitter());
    }
  }
  // Unreachable, but TypeScript wants it.
  return { ok: false, isError: true, content: explainError(lastErr) };
}

function isTransient(err: unknown): boolean {
  if (!(err instanceof HttpError)) return isNetworkError(err);
  return err.status === 429 || (err.status >= 500 && err.status < 600);
}

Idempotency keyed on tool_use_id

The tool, not the loop, owns this. Generate an idempotency key at the start of each tool invocation and pass it to any downstream API that accepts one — Stripe, payment processors, most modern REST APIs. For your own internal tools, persist a "this tool_use_id has already run" row in the same transaction as the side effect, and short-circuit on a duplicate. The tool_use.id from Anthropic's response and the tool_call.id from OpenAI's response are both unique per invocation and both ideal keys for this. The point is that if the loop retries — for any reason, including a process restart in the middle of an agent run — the second execution is a no-op that returns the original result.

async function chargeCustomer(
  toolUseId: string,
  customerId: string,
  amountCents: number,
): Promise<string> {
  return await db.transaction(async (tx) => {
    const existing = await tx.toolSideEffects.findById(toolUseId);
    if (existing) return existing.result; // already ran — return original

    const charge = await stripe.charges.create(
      { customer: customerId, amount: amountCents, currency: "usd" },
      { idempotencyKey: toolUseId },
    );

    const result = JSON.stringify({ charge_id: charge.id, status: charge.status });
    await tx.toolSideEffects.insert({
      tool_use_id: toolUseId,
      result,
      created_at: new Date(),
    });
    return result;
  });
}

Maximum tool-call depth per user prompt

Anthropic's loop exit condition is documented — the loop continues while stop_reason is tool_use and exits on end_turn, max_tokens, stop_sequence, or refusal. That is the model's exit, not yours. You also need a hard count of how many tool calls the loop has executed in this user prompt, and you need to bail when it crosses a ceiling. Twenty-five is a reasonable interactive ceiling. One hundred is reasonable for a long-running background agent. Either way, bail with a structured error that the caller can surface as "the agent took too long to complete this task" instead of letting the process drift into an unbounded loop.

Conversation persisted after every turn

If the host process dies — OOM, container restart, deploy rollover — the user's next request should resume from the last persisted state, not start over. The unit of persistence is the messages array: the system prompt, every user message, every assistant message (including its tool_use blocks), and every user message containing tool_result blocks. Persist after each turn, key it by a conversation ID, and on resume reconstruct the loop from the last saved state. The idempotency layer in the previous piece is what makes resume safe — any tool calls that already executed get short-circuited on replay.

async function runAgentLoop(
  conversationId: string,
  userMessage: string,
  opts: { maxToolCalls?: number } = {},
): Promise<string> {
  const maxToolCalls = opts.maxToolCalls ?? 25;
  let toolCallCount = 0;

  let messages = await db.conversations.loadMessages(conversationId);
  messages.push({ role: "user", content: userMessage });
  await db.conversations.saveMessages(conversationId, messages);

  while (true) {
    const response = await anthropic.messages.create({
      model: "claude-opus-4-7",
      max_tokens: 4096,
      tools: TOOL_DEFINITIONS,
      messages,
    });

    messages.push({ role: "assistant", content: response.content });
    await db.conversations.saveMessages(conversationId, messages);

    if (response.stop_reason !== "tool_use") {
      return extractFinalText(response.content);
    }

    const toolUses = response.content.filter(
      (block): block is ToolUseBlock => block.type === "tool_use",
    );
    toolCallCount += toolUses.length;
    if (toolCallCount > maxToolCalls) {
      throw new AgentBudgetExceededError(conversationId, toolCallCount);
    }

    // Parallel execution: run all tool_use blocks from this turn together.
    const toolResults = await Promise.all(
      toolUses.map(async (block) => {
        const result = await dispatchTool(block.id, block.name, block.input);
        return {
          type: "tool_result" as const,
          tool_use_id: block.id,
          content: result.content,
          is_error: result.ok ? undefined : true,
        };
      }),
    );

    // tool_result blocks must come FIRST in the content array (Anthropic spec).
    messages.push({ role: "user", content: toolResults });
    await db.conversations.saveMessages(conversationId, messages);
  }
}

The OpenAI shape is the same loop with a different surface — tool_calls instead of tool_use blocks, role: "tool" messages instead of tool_result content blocks, tool_call_id instead of tool_use_id. The structural rules are the same: every tool_call from the assistant turn needs a matching tool message in the next request, and parallel tool calls return all their results before the loop continues.

async function runOpenAiAgentLoop(
  conversationId: string,
  userMessage: string,
  opts: { maxToolCalls?: number } = {},
): Promise<string> {
  const maxToolCalls = opts.maxToolCalls ?? 25;
  let toolCallCount = 0;

  let messages = await db.conversations.loadOpenAiMessages(conversationId);
  messages.push({ role: "user", content: userMessage });
  await db.conversations.saveOpenAiMessages(conversationId, messages);

  while (true) {
    const response = await openai.chat.completions.create({
      model: "gpt-4.1",
      messages,
      tools: OPENAI_TOOL_DEFINITIONS,
    });

    const assistantMessage = response.choices[0].message;
    messages.push(assistantMessage);
    await db.conversations.saveOpenAiMessages(conversationId, messages);

    const toolCalls = assistantMessage.tool_calls ?? [];
    if (toolCalls.length === 0) {
      return assistantMessage.content ?? "";
    }

    toolCallCount += toolCalls.length;
    if (toolCallCount > maxToolCalls) {
      throw new AgentBudgetExceededError(conversationId, toolCallCount);
    }

    // Every tool_call.id needs a matching role: tool message in the same batch.
    const toolMessages = await Promise.all(
      toolCalls.map(async (call) => {
        const result = await dispatchTool(
          call.id,
          call.function.name,
          JSON.parse(call.function.arguments),
        );
        return {
          role: "tool" as const,
          tool_call_id: call.id,
          content: result.content,
        };
      }),
    );
    messages.push(...toolMessages);
    await db.conversations.saveOpenAiMessages(conversationId, messages);
  }
}

A calm CTO reviews a clean, abstract dashboard with green and blue metrics, confident in a stable LLM agent tool call retry system.

Common failure modes

The first sharp edge is the runaway retry inside the tool. A naive backoff with no jitter, no maximum attempt count, and no transient-vs-permanent split will sit on a permanent 401 for hours, waste the connection pool, and turn a single bad API key into an outage. Cap attempts. Add jitter. Refuse to retry anything that is not transient. The point of the retry is to absorb noise, not to paper over real problems.

The second is the "model retries forever with the same parameters" trap. The model emits a tool call, the call returns an error, the model emits the same call again. This usually means the error message was not instructive — "failed" gave the model no signal to vary its approach. The fix is on your side, not the model's. Write error messages that name the constraint that was violated: "Invalid date format. Use YYYY-MM-DD." or "User not found. Try search_users first to find the right ID." The model is responsive to specifics.

The third is the half-executed parallel tool batch. Anthropic returns multiple tool_use blocks in a single assistant turn when parallel tool use is enabled. If one of them throws and your loop returns a tool_result for the others but not for the failed one, the next request will reject with the "tool_use ids were found without tool_result blocks immediately after" error. The fix is to return one tool_result per tool_use without exception, even if the content is just the error message — and to do this in the order the tool_use blocks appeared.

The fourth is the resume-from-stale-state bug. The loop saves the messages array, the process restarts, the loop resumes. Between save and restart, a tool side effect already ran and committed to the database. The replay re-executes the same tool call. Without the idempotency layer in piece three this is a real double-execution. With it, the second call short-circuits, returns the original result, and the loop continues with no visible damage. This is the single most important reason idempotency lives in the tool and not the loop — the loop cannot know whether it is on a fresh run or a resume, and it does not need to.

The fifth is the tool-call budget that fires on a long-running but legitimate task. A research agent doing a literature review honestly might make ninety tool calls. A code-fixing agent traversing a large repo honestly might make four hundred. A single global cap is the wrong primitive. The right one is a per-agent-type cap, declared up front, paired with a token budget. Interactive chat agents get 25 calls and 50k tokens. Background research agents get 200 calls and 500k tokens. Code agents get whatever the workspace honestly needs. Make it a config, not a constant.

The sixth, more subtle, is the prompt-injected tool call. The model reads a tool result, the result contains text that looks like an instruction to call another tool, and the model complies. This is not a loop problem in the narrow sense, but the loop is where you defend against it — by sanitizing or escaping tool output that came from an untrusted source before it goes back into the messages array, and by restricting the tool set available to subloops that handle untrusted content. The tool-call budget helps here too: a prompt injection that bursts the loop into ten unauthorized calls hits the ceiling and aborts.

What this looks like in production

At BFEAI we run agent loops behind several user-facing features — a research agent that produces sourced briefs, a code-assist agent that operates against the user's repo, and an internal ops agent that handles support triage. The pattern above is what keeps the three of them from sharing failure modes. Each one declares its own tool-call budget at the top of the loop and persists messages to Postgres after every turn. The tool layer underneath is shared: every tool is wrapped in the same retry-with-backoff function, every side-effecting tool routes through the same idempotency table keyed on tool_use_id, and every tool returns either a clean string or an is_error: true payload with an instructive message.

The metric that matters most is not "tool error rate." Errors are expected — the model is supposed to encounter them and recover. The metric that matters is "tool errors that triggered a successful model recovery" versus "tool errors that aborted the loop." When the first number dwarfs the second, the loop is doing its job. When the second number creeps up, something has regressed — usually an error message that stopped being instructive or a tool that started throwing a new error type the model has not been trained against.

The dashboard a CTO actually wants for an agent product has four rows: median tool calls per user prompt (climbing means the model is getting confused more often), p99 tool calls per user prompt (spikes mean the loop cap is being hit), tool-error-to-model-recovery rate (declining means error messages have gotten worse), and replayed-tool-call rate (climbing means the agent process is restarting more often). Anything else is detail. Those four numbers stay flat in normal operation, and a sustained move in any of them is a real signal.

The runbook for an agent incident is short by design. Step one: pull the conversation ID and load the messages array. Step two: find the turn where things went wrong — usually the first tool_result with is_error: true that the model failed to recover from. Step three: look at the error message that was returned. In our experience the message was too generic. Fix the message, deploy, and the same class of incident stops recurring. The model is the responsive part of the system; the static parts — your error strings, your tool descriptions, your tool budgets — are where the leverage is.

The last operational detail that matters is what to log. Every tool invocation gets a structured log line with the conversation ID, the tool_use_id, the tool name, the input shape, the outcome (ok, retried, transient_fail, permanent_fail), and the latency. Every loop exit gets a log line with the exit reason (end_turn, max_tokens, refusal, budget_exceeded, error). When a user reports "the agent did something weird," that log is the answer. Without it you are reading model output and guessing, which by then has already been compacted into prose and lost the per-call detail you need to diagnose.

What to watch in your own implementation

Open your agent codebase and search for the loop. For each tool dispatch site, answer four questions. First: when the tool throws, does the exception bubble up and crash the loop, or does it become a tool_result with is_error: true? Second: does the tool itself distinguish transient from permanent errors and retry the transient ones with backoff, or does every error bounce all the way back to the model? Third: if the same tool_use_id were dispatched twice — by a retry, by a replay, by a process restart — would the side effect run once or twice? Fourth: does the loop have a hard ceiling on tool calls per user prompt, and is that ceiling enforced before the next API request goes out?

If the answer to any of these is wrong, that is the bug. Fix the side-effecting tools first — those are the ones where a double-execution costs real money. Then fix the error-message quality, because that is the single biggest lever on model recovery rate. Then add the persistence layer, because that is what lets you survive a deploy in the middle of an agent run without losing user work.

Finally, run a one-week audit on your existing agent traffic. Sample fifty conversations. For each one, count the tool calls, count the tool errors, and read what the model did after each error. The conversations where the model recovered cleanly are the baseline. The conversations where it spun, gave up, or surfaced a generic apology are the work. In our experience the work is in the error messages, not in the loop or the model. Make the messages instructive, and the recovery rate moves on the next deploy.

Outcomes you should expect

What this delivers

Tool failures stop crashing the agent run — they become a sentence in the conversation that the model reasons about and recovers from.
Transient failures are absorbed by typed retry with backoff before they ever reach the model, so the conversation history is not polluted with five attempts at the same call.
Tool calls are idempotent by construction, so a retry after an ACK timeout cannot double-execute (no double-charged customer, no duplicate row, no second email).
An agent process that dies mid-loop resumes from the last persisted turn instead of starting over, which protects partially-spent tool budgets and partially-applied side effects.
A maximum tool-call depth bounds runaway loops so a single user prompt cannot spend an unbounded amount of money on tool calls.

Primary sources

By the numbers

Anthropic's tool-use loop is keyed on stop_reason: while stop_reason is 'tool_use' the application executes the tool and continues the conversation; the loop exits on any other stop reason ('end_turn', 'max_tokens', 'stop_sequence', or 'refusal').
Source ↗
If a client tool throws an error during execution, the application returns the error message in the tool_result content along with is_error: true, and Claude incorporates that error into its next response.
Source ↗
If a tool request is invalid or missing parameters, Claude will retry 2-3 times with corrections before apologizing to the user.
Source ↗
Tool result blocks must immediately follow their corresponding tool_use blocks in the message history, and within the user message the tool_result blocks must come FIRST in the content array.
Source ↗
OpenAI function calling returns tool outputs via a message with role 'tool' and a tool_call_id that references the specific call being answered; the model may call multiple functions in a single turn unless parallel_tool_calls is set to false.
Source ↗
Anthropic recommends writing instructive tool error messages — instead of a generic 'failed', include what went wrong and what Claude should try next (e.g. 'Rate limit exceeded. Retry after 60 seconds.') so the model can recover or adapt without guessing.
Source ↗

Live in production today

The same engineering, shipped in production at BFEAI.

I'm co-founder & CTO of Be Found Everywhere (BFEAI), a 7-app AI SaaS platform running today. The work I deliver for clients is the work I do every week on my own platform.

Production apps

200K+

Keywords generated

1,500+

AI scans run

7,000+

Sites automated

Back to Rag And Llm Agents

Common questions

What buyers ask before reaching out

When a tool errors, should I crash the agent or send the error back to the model?

Send it back. Anthropic's documented pattern is to return the error message in the tool_result content with is_error: true; OpenAI's pattern is to return a string describing the failure in the tool message. In both cases the model treats the error as new context and decides what to do next — retry with different parameters, switch to a different tool, or surface a clean explanation to the user. Crashing the whole run on the first transient failure throws away that recovery capability.

What's the difference between a transient and a permanent tool error?

A transient error is one a retry might fix — HTTP 429, HTTP 5xx, connection reset, socket timeout. A permanent error is one a retry will not fix — HTTP 4xx other than 429, validation failure, not-found, forbidden. The pattern is to retry transient errors inside the tool with exponential backoff and avoid telling the model about it, and to return permanent errors to the model immediately as is_error: true so it can choose a different approach.

Why do I need a maximum tool-call loop limit?

Because without one a confused model can spin forever, and each turn costs real money in both tokens and downstream API calls. A typical limit is 25 tool calls per user prompt for an interactive agent and 100 for a long-running task. When the limit hits, the loop exits with a special stop reason and the application returns a clean error to the caller instead of letting the agent burn budget.

If the agent retries a tool call, won't I double-execute side effects?

Only if the tool implementation is not idempotent. The pattern is to generate an idempotency key at the start of each tool invocation and pass it to any downstream API that supports one (Stripe, payment processors, most modern REST APIs). For your own internal tools, persist a 'this tool_use_id has already run' row before the side effect, in the same transaction, and short-circuit on a duplicate. Idempotency lives in the tool, not the loop.

What happens if the agent process dies in the middle of a tool loop?

If you do nothing, the next request from the user starts a fresh conversation and the half-finished work is lost — including any side effects that already ran. The pattern is to persist the full conversation (messages array plus any partial tool_use blocks) after every turn, so a resume can pick up from the last persisted state. On resume, you replay the loop from there; any tool calls that already executed are short-circuited by their idempotency keys.

Should I let the model retry with different parameters, or force a specific retry?

Let the model decide. The whole point of returning the error to the conversation is that the model has the full context of what it was trying to do and can choose a better next step. Hard-coding the retry logic in your loop assumes you can predict what the model wants to try next, and you usually cannot. The exceptions are transient infrastructure errors, which the tool itself absorbs before the model ever sees them.

How do I keep the conversation from blowing up the context window when the agent has been looping for a while?

Two patterns work in production. The first is to compact older tool_use/tool_result pairs into a summary message once they're more than N turns back — the model can read the summary but does not need to re-process every parameter. The second is to set a per-loop token budget alongside the loop count cap, and exit when either is hit. Long-running agents need both; interactive agents usually only need the loop cap.

Does parallel tool use change any of this?

Yes — when the model returns multiple tool_use blocks in one assistant turn, the loop must execute all of them and return all the tool_result blocks before the next request. Anthropic's API requires the tool_result blocks to come first in the content array of the user message and to match the order of the tool_use blocks. OpenAI's parallel function calling has the same requirement: every tool_call_id from the assistant turn needs a matching role: tool message in the next request.

Ready to see if this is a fit?

A 15-minute call. No deck, no slides. We talk about what you're shipping and where engineering is the bottleneck. Either way, you walk away with a senior engineer's read on your situation.