A technical founder stares frustrated at a complex monitor display, a crumpled paper nearby, struggling with llm cost attribution per customer.
Production engineering patternUpdated

LLM Cost Attribution Per Customer: Postgres Pattern for Margin Math

An engineering pattern for turning a single OpenAI or Anthropic invoice into per-customer dollar figures, so margin math and pricing reviews stop being a guess.

The problem

You get the monthly OpenAI invoice. Then the Anthropic invoice. Each one is a single number with a few line items by model. Your CFO asks the question every CFO asks: "What did each customer cost us?" You do not have an answer. You have a total, and you have a customer list, and the gap between those two is where every margin decision lives.

Pooled-average attribution hides the heavy tail

The naive workaround is to divide the bill by active customers and call it a per-customer cost. That number is wrong on almost every product. LLM consumption is heavy-tailed — a handful of customers drive most of the token volume, and the median customer costs almost nothing. The pooled average makes the median customer look expensive and makes the heavy hitter look profitable when they are not. Pricing decisions based on the pooled number are pricing decisions made against a fiction.

Log-based math breaks on rate and cache shifts

The slightly less naive workaround is to compute usage per customer in your application logs, then back into a cost number using the public rate card. This breaks the first time a provider changes pricing mid-month, the first time you flip a model from non-cached to cached and forget to account for the discount, and the first time someone uses the same API key from a background job that has no customer attached. Within a quarter the number in your spreadsheet disagrees with the invoice by enough that nobody trusts either one, and the team falls back to "we are probably 60 percent gross margin, give or take."

Margin blind spots block pricing and contract decisions

The real failure is operational. Without per-customer cost, you cannot tell which customers are unprofitable at the current price, you cannot sanity-check a contract before sending it, and you cannot detect a runaway prompt that is silently costing thousands of dollars on a single account. The first time you find out is when a single customer's overage shows up on the next bill and someone has to call a meeting about it.

The pattern below puts a per-customer dollar figure on every billable LLM call, reflects prompt-caching discounts honestly, survives mid-month rate changes, and gives finance a single Postgres query that answers the cost-per-customer question without re-deriving anything from provider dashboards.

Two engineers collaborate intently, one at a keyboard, the other pointing at an abstract data flow on a monitor, building a postgres solution.

What changes for your business

There are three pieces. A llm_usage_events table that captures one row per LLM call with the fields a rollup needs. A llm_model_rates table that holds per-model pricing with effective dates. A nightly rollup job that joins the two and writes a per-customer-per-day cost row. The pieces are small on their own — most of the value is in the discipline of writing a row for every call and stamping the rate at call time.

Per-call usage events table

Start with the usage table. The schema is intentionally narrow: it captures the dimensions a rollup needs and nothing else. PII does not belong here — the customer_id is enough, and joining out to your customer table at query time is fine.

CREATE TABLE llm_usage_events (
  id                uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  customer_id       text NOT NULL,
  -- Free-form scope: 'chat', 'agent.summarize', 'rag.answer', etc.
  -- Lets you attribute cost not just to a customer but to a feature.
  feature           text NOT NULL,
  provider          text NOT NULL,            -- 'openai' | 'anthropic'
  model             text NOT NULL,            -- exact API model id, e.g. 'gpt-5.4'
  input_tokens      integer NOT NULL DEFAULT 0,
  cached_tokens     integer NOT NULL DEFAULT 0,   -- read from cache
  cache_write_tokens integer NOT NULL DEFAULT 0,  -- Anthropic explicit cache write
  output_tokens     integer NOT NULL DEFAULT 0,
  -- Stamped at call time so historical rows survive provider price changes.
  rate_input_usd_per_mtok        numeric(10, 4) NOT NULL,
  rate_cached_input_usd_per_mtok numeric(10, 4) NOT NULL,
  rate_cache_write_usd_per_mtok  numeric(10, 4) NOT NULL DEFAULT 0,
  rate_output_usd_per_mtok       numeric(10, 4) NOT NULL,
  -- Cost computed at call time (cents to avoid float rounding drift).
  cost_micro_usd    bigint NOT NULL,
  request_id        text,                     -- provider-assigned id when available
  occurred_at       timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX llm_usage_events_customer_day
  ON llm_usage_events (customer_id, date_trunc('day', occurred_at));

CREATE INDEX llm_usage_events_day_model
  ON llm_usage_events (date_trunc('day', occurred_at), provider, model);

Rates table with effective-date ranges

The rates table is small enough to keep in version control, but storing it in Postgres lets you change rates without a deploy. The effective-date range is what makes mid-month rate changes safe — you close out the current row and insert a new one, and historical rows already carry the old rate stamped at call time.

CREATE TABLE llm_model_rates (
  provider                       text NOT NULL,
  model                          text NOT NULL,
  input_usd_per_mtok             numeric(10, 4) NOT NULL,
  cached_input_usd_per_mtok      numeric(10, 4) NOT NULL,
  cache_write_5m_usd_per_mtok    numeric(10, 4) NOT NULL DEFAULT 0,
  cache_write_1h_usd_per_mtok    numeric(10, 4) NOT NULL DEFAULT 0,
  output_usd_per_mtok            numeric(10, 4) NOT NULL,
  effective_from                 timestamptz NOT NULL,
  effective_to                   timestamptz,
  PRIMARY KEY (provider, model, effective_from)
);

-- The current-rate lookup the call-time wrapper uses.
CREATE OR REPLACE VIEW llm_model_rates_current AS
SELECT *
FROM llm_model_rates
WHERE effective_to IS NULL;

Wrapper as the only path to the provider SDKs

The TypeScript wrapper is where attribution becomes a discipline instead of a hope. The wrapper is the only path to the provider SDKs in the application. Direct imports of openai or @anthropic-ai/sdk outside the wrapper are caught in code review. The wrapper requires a customer_id and a feature name on every call, looks up the current rate, runs the request, and writes the usage row before returning.

type LlmCallArgs = {
  customer_id: string;
  feature: string;          // e.g. 'agent.summarize'
  model: string;            // exact model id, e.g. 'claude-opus-4-7'
  messages: AnthropicMessageParam[];
  cache_breakpoint?: boolean;
};

export async function callAnthropic(args: LlmCallArgs): Promise<AnthropicMessage> {
  const rate = await rates.getCurrent("anthropic", args.model);
  if (!rate) throw new Error(`No rate row for anthropic/${args.model}`);

  const response = await anthropic.messages.create({
    model: args.model,
    max_tokens: 4096,
    messages: args.messages,
    metadata: {
      // Opaque per-customer id for provider-side attribution.
      // Hashing keeps PII out, per the Messages API guidance.
      user_id: hmacCustomerId(args.customer_id),
    },
  });

  const usage = response.usage;
  const inputTokens       = usage.input_tokens ?? 0;
  const cachedTokens      = usage.cache_read_input_tokens ?? 0;
  const cacheWriteTokens  = usage.cache_creation_input_tokens ?? 0;
  const outputTokens      = usage.output_tokens ?? 0;

  // Cost is computed in micro-dollars (1e6 micro = 1 USD) to keep
  // integer math through the rollup and avoid float rounding drift.
  const costMicroUsd =
    Math.round((inputTokens       / 1_000_000) * rate.input_usd_per_mtok          * 1_000_000) +
    Math.round((cachedTokens      / 1_000_000) * rate.cached_input_usd_per_mtok   * 1_000_000) +
    Math.round((cacheWriteTokens  / 1_000_000) * rate.cache_write_5m_usd_per_mtok * 1_000_000) +
    Math.round((outputTokens      / 1_000_000) * rate.output_usd_per_mtok         * 1_000_000);

  await db.llmUsageEvents.insert({
    customer_id: args.customer_id,
    feature: args.feature,
    provider: "anthropic",
    model: args.model,
    input_tokens: inputTokens,
    cached_tokens: cachedTokens,
    cache_write_tokens: cacheWriteTokens,
    output_tokens: outputTokens,
    rate_input_usd_per_mtok:        rate.input_usd_per_mtok,
    rate_cached_input_usd_per_mtok: rate.cached_input_usd_per_mtok,
    rate_cache_write_usd_per_mtok:  rate.cache_write_5m_usd_per_mtok,
    rate_output_usd_per_mtok:       rate.output_usd_per_mtok,
    cost_micro_usd: costMicroUsd,
    request_id: response.id,
    occurred_at: new Date(),
  });

  return response;
}

Hashed provider-side identifier

The OpenAI variant is structurally identical. The cost math reads usage.prompt_tokens_details.cached_tokens for the cache-hit count and subtracts it from usage.prompt_tokens to get the fresh-input count. The provider-side attribution field is the safety_identifier parameter (the legacy user field still works on Chat Completions). Hash the customer_id before sending — the safety best-practices guidance is explicit about not sending raw usernames or email addresses.

export async function callOpenAI(args: LlmCallArgs): Promise<OpenAIChatCompletion> {
  const rate = await rates.getCurrent("openai", args.model);
  if (!rate) throw new Error(`No rate row for openai/${args.model}`);

  const response = await openai.chat.completions.create({
    model: args.model,
    messages: args.messages as OpenAIChatMessage[],
    safety_identifier: hmacCustomerId(args.customer_id),
  });

  const usage = response.usage;
  const cachedTokens = usage?.prompt_tokens_details?.cached_tokens ?? 0;
  const inputTokens  = (usage?.prompt_tokens ?? 0) - cachedTokens;
  const outputTokens = usage?.completion_tokens ?? 0;

  const costMicroUsd =
    Math.round((inputTokens  / 1_000_000) * rate.input_usd_per_mtok        * 1_000_000) +
    Math.round((cachedTokens / 1_000_000) * rate.cached_input_usd_per_mtok * 1_000_000) +
    Math.round((outputTokens / 1_000_000) * rate.output_usd_per_mtok       * 1_000_000);

  await db.llmUsageEvents.insert({
    customer_id: args.customer_id,
    feature: args.feature,
    provider: "openai",
    model: args.model,
    input_tokens: inputTokens,
    cached_tokens: cachedTokens,
    cache_write_tokens: 0,
    output_tokens: outputTokens,
    rate_input_usd_per_mtok:        rate.input_usd_per_mtok,
    rate_cached_input_usd_per_mtok: rate.cached_input_usd_per_mtok,
    rate_cache_write_usd_per_mtok:  0,
    rate_output_usd_per_mtok:       rate.output_usd_per_mtok,
    cost_micro_usd: costMicroUsd,
    request_id: response.id,
    occurred_at: new Date(),
  });

  return response;
}

Idempotent nightly rollup

The nightly rollup is the part finance cares about. It writes one row per customer per day with cost broken out by provider so the monthly margin report joins straight from a single table. The rollup is idempotent on (customer_id, day) so re-running for a historical day is safe — useful when you backfill a missed window or recompute after a rate correction.

CREATE TABLE llm_cost_daily (
  customer_id   text NOT NULL,
  day           date NOT NULL,
  provider      text NOT NULL,
  input_tokens  bigint NOT NULL,
  cached_tokens bigint NOT NULL,
  output_tokens bigint NOT NULL,
  cost_usd      numeric(12, 4) NOT NULL,
  PRIMARY KEY (customer_id, day, provider)
);

-- Idempotent UPSERT for any UTC day, given a $1 boundary.
INSERT INTO llm_cost_daily (customer_id, day, provider,
                            input_tokens, cached_tokens, output_tokens, cost_usd)
SELECT
  customer_id,
  date_trunc('day', occurred_at AT TIME ZONE 'UTC')::date AS day,
  provider,
  SUM(input_tokens)::bigint,
  SUM(cached_tokens)::bigint,
  SUM(output_tokens)::bigint,
  (SUM(cost_micro_usd) / 1000000.0)::numeric(12, 4) AS cost_usd
FROM llm_usage_events
WHERE occurred_at >= $1::timestamptz
  AND occurred_at <  ($1::timestamptz + interval '1 day')
GROUP BY customer_id, day, provider
ON CONFLICT (customer_id, day, provider) DO UPDATE
SET input_tokens  = EXCLUDED.input_tokens,
    cached_tokens = EXCLUDED.cached_tokens,
    output_tokens = EXCLUDED.output_tokens,
    cost_usd      = EXCLUDED.cost_usd;

date_trunc() on a timestamp with time zone truncates with respect to a particular zone. The Postgres docs are explicit that the optional time zone argument can pick the zone you bucket on. Use UTC for billing — it removes a class of off-by-one bugs around daylight saving and makes the math obvious in audits. If a customer-facing dashboard needs the customer's local zone, do that conversion at read time, not at rollup time.

Monthly provider reconciliation

The provider-reconciliation step is the second half of trust. Once a month, after the OpenAI and Anthropic invoices arrive, run a check job that sums cost_usd per provider for the invoiced period and compares it to the invoice total. The gap should be small — single-digit percent at most — and it should be explainable by tier discounts, batch-API discounts, or fine-tuning line items the per-call math does not see. A widening gap means an attribution leak: somewhere a call is happening outside the wrapper, or a rate row is stale, or a model id has shifted (the API quietly maps deprecated aliases) and the rates table still carries the old name.

A confident technical leader reviews a clean, color-coded dashboard on a monitor, clearly showing llm cost attribution per customer.

More on this

Common failure modes

The first sharp edge is calls that bypass the wrapper. Someone imports openai directly to prototype a feature, ships it, and the production traffic on that endpoint disappears from the attribution table. The fix is mechanical: a lint rule that bans the provider package imports outside the wrapper module, and a CI check that runs the lint on every PR. Without that, the leak grows monotonically and the reconciliation gap widens until nobody trusts the per-customer number.

The second sharp edge is the missing customer_id. Background jobs, admin tools, evaluation runs, and internal demos all call LLMs. They have no customer attached. The temptation is to default to a placeholder string, but a placeholder means real per-customer reports double-count the placeholder bucket with any customer who shares that string. Use a dedicated internal:<scope> convention — internal:eval, internal:admin, internal:backfill — and treat any row with an internal: customer_id as overhead in the margin report.

The third sharp edge is the rate change you forgot to record. A provider drops a price by 20 percent on the first of the month, you do not notice for a week, and during that week the rates table still carries the old number. The cost rows over-state actual spend for that week. The fix has two halves: a watcher that pings the rates page weekly and alerts on a delta, and a discipline that any time you see the invoice come in lower than the per-customer sum suggests, you backfill the rate row and recompute the affected days. The recompute is cheap because the rollup is idempotent.

The fourth sharp edge is model-id drift. Provider APIs accept aliases like gpt-4o that route to the current default snapshot. The snapshot can change. Your rates table has a row for gpt-4o, but the call now routes to a model with a different rate, and the wrapper records the alias as the model. The fix is to record the exact model id the response returns (response.model on both providers) in addition to the requested id, and join the rates table on the resolved id. When the alias rotates, the rates table either has a matching resolved-id row or it does not, and you find out at write time, not at month end.

The fifth sharp edge is cached_tokens being conflated with input_tokens. Both providers report the cached count separately, and the discount is meaningful — Anthropic prices cache reads at 10 percent of the base input rate, OpenAI applies a similar discount on cached input. Wrappers that sum input_tokens + cached_tokens into a single counter and apply the base rate to the whole thing overcharge cache-heavy workloads in the internal model. The schema above carries cached_tokens as its own column for this reason, and the rates table has its own column for the cached rate so the multiplication is unambiguous.

The sixth sharp edge is timezone confusion in the rollup. Daily cost rows bucketed in the application server's local time create off-by-one disagreements with the customer's billing window and with the provider invoice (which closes UTC). Pick UTC for the rollup table, convert to the display zone at the read site, and label the column day_utc in the schema if it helps future-you remember which zone the bucket is in.

What this looks like in production

At BFEAI the attribution table is the source of truth for both the internal margin dashboard and the per-customer credit consumption shown in the product. The rollup runs at 03:15 UTC against the previous UTC day. The dashboard ships finance three numbers per customer: tokens consumed, dollar cost, and a margin number that joins to the subscription line in Stripe. The margin number is the one that drives renewal pricing conversations — anything under a target threshold gets a flag and a manual review before the customer's next renewal cycle.

The reconciliation job runs after each provider invoice lands. The pass is simple: sum cost_usd for the invoiced period, compare to invoice total, log the gap. A gap inside three percent is normal — provider-side rounding, batch API discounts, and the small lag between request and response timestamp account for most of it. A gap over five percent triggers an investigation, and the investigation typically lands on one of the failure modes in the previous section: an unrouted call site, a stale rate row, or a model alias rotation we missed.

The provider-side attribution — the metadata.user_id on Anthropic, the safety_identifier on OpenAI — pays for itself the first time a customer triggers a provider safety review. Anthropic's docs frame the field as helping them detect abuse; that detection is per-identifier, and when it fires they need a stable identifier to scope the issue. Without a per-customer identifier, a single misbehaving customer can pull the whole workspace into a review. Hashing the customer_id keeps the identifier stable across requests without leaking PII — the safety guidance from both providers is explicit on this.

The other operational habit that matters is keeping the per-feature breakdown. The schema's feature column lets you ask not just "what did customer X cost us" but "what does the summarize agent cost across all customers" — which is the question that drives prompt-engineering investments. A summarize agent that costs 3 cents per call across a million calls a month is the one to optimize; a low-volume reasoning agent that costs 30 cents per call is not. The per-customer-per-feature pivot makes that obvious in a single SQL query.

What to watch in your own implementation

Open your codebase and grep for direct imports of the provider SDKs. Every import outside the wrapper module is a leak in the attribution table. Land the lint rule that bans them before you build any of the rest of this — the wrapper is the boundary, and the boundary has to be enforced mechanically or it will erode.

Then check whether your call sites all have a customer_id. Search for wrapper calls without one. Any call site missing the argument is either a background job that needs an internal: scope or a feature that lost its customer context somewhere up the stack. The second case is the dangerous one — a feature that has a real customer but lost track of it is a feature you will undercharge.

Finally, run a one-month reconciliation: sum your per-customer cost numbers, compare to the provider invoice total. The gap is the size of the attribution leak. Anything under three percent is fine. Anything larger and the priorities are clear — find the unrouted call sites, refresh the rate rows, and recompute the affected days from the idempotent rollup. The number in finance's spreadsheet stops being a guess once the gap is small, and stays that way as long as the wrapper boundary holds.

Outcomes you should expect

What this delivers

  • Every paid customer has a daily and monthly LLM cost number in Postgres, ready for margin math and pricing reviews.
  • Provider rate changes mid-month do not corrupt historical attribution because the rate that applied at call time is stamped on the usage row.
  • Prompt-caching discounts are reflected per-customer instead of disappearing into a single pooled invoice line.
  • Finance can answer 'what did customer X cost us last month' in one query, not by re-deriving the number from OpenAI and Anthropic dashboards.

Primary sources

By the numbers

  • Anthropic's prompt caching prices 5-minute cache write tokens at 1.25 times the base input token price, 1-hour cache write tokens at 2 times, and cache read tokens at 0.1 times the base input token price.

    Source ↗

  • Anthropic's Messages API exposes cache_creation_input_tokens, cache_read_input_tokens, and input_tokens fields in the usage block, and total input tokens equals the sum of those three counters.

    Source ↗

  • Anthropic's metadata.user_id field is described as an external identifier that 'should be a uuid, hash value, or other opaque identifier' and explicitly says 'Do not include any identifying information such as name, email address, or phone number.'

    Source ↗

  • OpenAI prompt caching is enabled automatically for prompts that are 1024 tokens or longer, and the response surfaces a prompt_tokens_details.cached_tokens field counting how many prompt tokens were served from cache.

    Source ↗

  • OpenAI's safety identifier guidance tells developers to 'Hash the username or email address in order to avoid sending us any identifying information' when populating the per-request user identifier.

    Source ↗

  • Postgres date_trunc() on a timestamp with time zone is documented to perform truncation 'with respect to a particular time zone' and accepts an optional time_zone argument, so daily and monthly billing windows can be aligned to a specific zone.

    Source ↗

Live in production today

The same engineering, shipped in production at BFEAI.

I'm co-founder & CTO of Be Found Everywhere (BFEAI), a 7-app AI SaaS platform running today. The work I deliver for clients is the work I do every week on my own platform.

7

Production apps

200K+

Keywords generated

1,500+

AI scans run

7,000+

Sites automated

Common questions

What buyers ask before reaching out

Why not just divide the monthly OpenAI bill by customer count?

Because the distribution is rarely even. In most LLM products one or two customers drive a disproportionate share of token volume, and a pooled per-customer average hides the customers who are unprofitable at the current price. The per-customer table lets you spot the ones whose cost exceeds their revenue before the next renewal.

How do I attribute usage to the right customer when the API call is server-to-server?

The customer_id has to be passed in from the caller and stored on the usage row at write time. You cannot derive it later from the API key, because internal jobs and admin tools often share keys. Make customer_id a required argument on every wrapper function that calls OpenAI or Anthropic, and fail fast in code review if a call site omits it.

What goes in the metadata.user_id field that Anthropic and OpenAI accept?

An opaque hash of your internal customer_id. Anthropic's docs are explicit that the value should be a UUID, hash, or opaque identifier and explicitly tell you not to send names, emails, or phone numbers. OpenAI's safety guidance says the same — hash the username or email before sending. Use a deterministic HMAC of customer_id with a server-side secret so the same customer maps to the same identifier across requests.

How do I handle provider rate changes mid-month?

Stamp the rate on the usage row at call time, not at rollup time. The rates table is small — model name, input rate, output rate, cached-read rate, effective_from, effective_to — and the call-time worker copies the current rate onto the usage_event row. When the provider changes pricing mid-month, you close out the old rate row and open a new one. Historical attribution stays correct because every row carries the rate that applied when the call ran.

Do cached tokens really need their own column?

Yes, if you want the attribution math to match the invoice. Cached input tokens are billed at a different rate from fresh input tokens — Anthropic's cache reads are 10 percent of the base input rate, and OpenAI also discounts cached input. Lumping them together overstates cost for cache-heavy workloads and understates it for cache-light ones. The schema needs three counters: input_tokens, cached_tokens, and output_tokens, plus the cache_write counter if you use Anthropic's explicit caching.

How often should the rollup job run?

Most teams settle on a nightly rollup against the previous UTC day, plus an on-demand recompute the rollup can be re-run for any historical day. The nightly cadence is enough for monthly margin math and for the billing-side worker that turns usage into invoice line items. If you also need real-time per-customer cost in an admin dashboard, build a continuous aggregate on top of the same source table — the rollup and the live view share the schema.

What about embedding calls and Whisper transcription — same table?

Same table, different model rows. The schema is generic over model name, and the rates table holds per-model rates regardless of whether the call is chat, embedding, or audio. The rollup multiplies tokens (or minutes, for audio) by the applicable rate. The only thing that changes is what you put in the input_tokens column for non-token-priced models — for Whisper it is duration in seconds, which the rates table prices per minute.

Where does this break under high write volume?

The single bottleneck is the per-call insert latency back into Postgres on the hot path. The fix is to write asynchronously: the LLM wrapper enqueues the usage row to a local buffer that flushes in batches every few seconds, and the insert happens off the request path. The buffer carries the same data the synchronous insert would have written, so the schema is unchanged. Loss tolerance is low because rows lost in the buffer become attribution gaps — back the buffer with a durable queue if your write rate justifies it.

Ready to see if this is a fit?

A 15-minute call. No deck, no slides. We talk about what you're shipping and where engineering is the bottleneck. Either way, you walk away with a senior engineer's read on your situation.