A concerned CTO reviews a complex, abstract data visualization on a large screen, highlighting risks in multi-tenant architecture for AI SaaS.

For your stackUpdated June 2026

Multi-Tenant Architecture for AI SaaS — RAG, LLM Keys, No Leaks

Per-tenant RAG namespaces, scoped LLM keys, cost budgets that cap before the bill arrives, and prompt customization without a fork — built so Customer A's data stays out of Customer B's chatbot.

Get a 15-min architecture read

The problem

AI SaaS multi-tenancy fails in places generic SaaS multi-tenancy does not. The classic tenant-leak failure mode — a forgotten WHERE clause in a relational query — is well-understood and solvable with Row Level Security. The AI-specific failure modes are newer and the patterns to prevent them are less established. A retrieval pipeline pulls the top-k nearest neighbors from a shared vector index, and the nearest neighbors to Customer B's question turn out to include Customer A's onboarding documents. A shared LLM API key racks up a $40,000 bill in a weekend because one customer's agent went into a retry loop on a tool call, and there is no way to tell finance which customer to bill for it. A prompt customization for one enterprise tenant ends up in a forked branch that drifts from main for six months until the next major refactor erases it. An eval harness for prompt regressions trains on Customer A's queries and silently includes them in a report sent to Customer B's account manager.

Trust-class incidents at Series A scale

For a Series A AI SaaS doing real revenue across a handful of enterprise customers, any of these incidents is the kind that ends the customer relationship and surfaces in a procurement-team retrospective at every other prospect's account. The cost is not the engineering time to fix the bug. It is the trust rebuild and the contractual indemnity exposure your enterprise customers asked for in their DPA. The brittle version of AI multi-tenancy looks fine in code review — "we filter retrieval by tenant_id" reads like an obvious safeguard. It only fails in the edge cases: a background job that re-indexes documents without scoping by tenant, a developer who tests a new chunking strategy against the production shared namespace, an agent loop that calls the LLM hundreds of times before the request handler returns and the per-request rate limiter catches it.

Isolation below the human-discipline layer

The architecture has to assume those edge cases will exist and prevent them at a lower layer than human discipline — which is exactly the same principle the relational multi-tenancy story already taught the industry, applied now to vector stores, LLM provider accounts, prompt configuration, and eval pipelines.

Engineers collaborate around a whiteboard, sketching abstract diagrams to design a secure multi-tenant architecture for AI SaaS.

What changes for your business

A correctly architected multi-tenant AI SaaS isolates five things, each at the layer where isolation is structurally enforceable rather than logically asserted.

Per-tenant vector namespaces

First, the vector store. Each tenant gets its own Pinecone namespace, per Pinecone's documented multi-tenancy pattern — each namespace is stored separately, every query targets exactly one namespace, and a leak between tenants requires a logic bug at the API layer rather than a missed metadata filter inside a query. The namespace-per-tenant approach also dominates on cost at any meaningful scale: querying a single tenant's 1 GB namespace costs 1 RU, where metadata-filtering a shared 100 GB namespace for the same tenant costs 100 RUs by Pinecone's own pricing model.

RLS on the relational side of retrieval

Second, the relational data behind retrieval. Source documents, chunk metadata, and tenant config live in Postgres with RLS enabled and FORCE ROW LEVEL SECURITY set on every tenant-scoped table — and pgvector similarity queries respect those RLS policies, per Supabase's own RAG-with-permissions documentation. A vector search inside Postgres only scans rows the requesting tenant owns, so the relational side of retrieval inherits the same default-deny that protects the rest of the schema.

Scoped LLM provider credentials

Third, the LLM provider account. High-value tenants get a per-tenant project key (OpenAI, which carries its own RPM, TPM, RPD, TPD, and IPM limits configurable per model) or a per-tenant Workspace (Anthropic, which supports per-workspace spend and rate limit overrides on top of the organization-level ITPM and OTPM ceilings). Smaller tenants share a key with attribution done in your own ledger. The benefit is twofold — provider-side rate limits cap one tenant's runaway agent before it eats the platform's bill, and per-tenant usage attribution stops being a spreadsheet reconciliation at month end.

Per-tenant cost budgets and rate limits

Fourth, cost budgets and rate limits. Provider-side ceilings are a backstop, not a front line. A per-tenant token bucket lives in your backend, checked synchronously before every LLM call, with input/output token counters that roll over per minute and dollar counters that roll over per billing period. A tenant that hits its budget gets a clean 429 with a "tenant budget exceeded" error rather than a confused timeout when the upstream provider rate-limits the whole org. The bucket increments on the response so cached input tokens — which Anthropic does not count toward ITPM for most models — get attributed accurately.

Prompt customization and per-tenant evals

Fifth, prompt customization and evals. Prompts are versioned rows in a tenant_config table, not branches in source. The application loads the per-tenant template at request time, fills the configurable slots (tenant name, custom instructions, banned topics, persona tone), and sends the assembled prompt to the LLM. Evals are stored per tenant with the same RLS protections as live data — a prompt change ships only after passing every tenant's eval set, and a regression report for Customer A is structurally incapable of containing Customer B's queries.

The outcome the founder cares about: the 3am "why is our chatbot quoting Customer A's pricing to Customer B" incident does not happen, the finance team gets per-tenant AI cost numbers without a reconciliation script, and a single runaway agent on one tenant's account stops at that tenant's budget instead of consuming the platform's monthly LLM bill.

A confident founder sips coffee, reviewing a clean, color-blocked dashboard on a laptop, reflecting stable multi-tenant architecture for AI SaaS.

What gets shipped for AI-specific multi-tenancy

The work for an AI SaaS at this intersection lands in a predictable shape, layered on top of the standard multi-tenant base (RLS, JWT auth, per-tenant Stripe Customer mapping). The AI-specific layers go in this order because each depends on the one before it.

A per-tenant namespace provisioning flow comes first. When a tenant is created in your control plane, a Pinecone namespace gets provisioned alongside the Postgres tenant row, with the namespace ID stored on the tenant record. All retrieval calls take the namespace ID from the authenticated tenant context — not from request input — so a malicious or buggy client cannot supply another tenant's namespace. The same flow handles deletion: a churned tenant's namespace gets dropped as a single Pinecone API call, and every vector for that tenant goes with it because nothing was shared.

A per-tenant LLM credential layer ships next. The control plane provisions an OpenAI project key or Anthropic workspace per high-value tenant, stores the credential reference (not the raw key) on the tenant record, and routes LLM calls through a small adapter that loads the right credential for the current tenant. Smaller tenants share a credential pool, with usage attributed in your ledger by tenant_id and reconciled against the provider's usage API on a daily cron.

The cost-and-rate budget layer ships next, sitting in front of the LLM adapter. A Redis-backed token bucket per tenant tracks RPM, ITPM, and OTPM. A separate per-tenant dollar counter rolls over per billing period and is checked before each call. The 429 response carries a structured error code (tenant_budget_exceeded, tenant_rate_limit_exceeded) that downstream clients can surface to the customer cleanly rather than treating as a generic upstream failure.

Prompt customization ships as a tenant_config table with versioned prompt rows and a small templating layer that fills the slots. The control plane lets a customer-success engineer adjust a tenant's prompt template without a deploy and roll it forward or back atomically. The eval harness — the last piece — stores per-tenant eval sets in the same RLS-protected table, runs candidate prompts against each tenant's eval set as a CI gate, and emits per-tenant pass/fail reports that respect the same tenant boundaries as live retrieval.

// LLM call goes through a tenant-aware adapter — not directly to the SDK.
// The adapter loads the tenant's credential, checks budgets, calls the
// provider, then increments the ledger on the response.

async function tenantAwareCompletion(tenantId: string, request: LlmRequest) {
  // 1. Budget gate (cheap — Redis check)
  const budget = await budgets.check(tenantId, request.estimatedTokens);
  if (!budget.allowed) {
    throw new TenantBudgetExceededError(tenantId, budget.reason);
  }

  // 2. Load tenant credential (per-project key for OpenAI, per-workspace
  //    auth for Anthropic, or shared pool with ledger attribution).
  const credential = await credentials.forTenant(tenantId);

  // 3. Load per-tenant prompt template — versioned, configurable, not forked.
  const prompt = await prompts.render(tenantId, request.template, request.vars);

  // 4. Provider call. Cached input tokens (Anthropic) get attributed
  //    separately because they do not count against ITPM for most models.
  const response = await provider.complete(credential, prompt, request.options);

  // 5. Ledger — per-tenant attribution survives even if the provider's
  //    own usage view groups everything under one org-level account.
  await ledger.record(tenantId, response.usage);

  return response;
}

-- Per-tenant prompt templates live as versioned config rows, not in source.

create table public.tenant_prompts (
  id uuid primary key default gen_random_uuid(),
  tenant_id uuid not null references public.tenants(id) on delete cascade,
  template_name text not null,
  version int not null,
  body text not null,
  created_at timestamptz not null default now(),
  unique (tenant_id, template_name, version)
);

alter table public.tenant_prompts enable row level security;
alter table public.tenant_prompts force  row level security;

create policy tenant_isolation_select on public.tenant_prompts
  for select using (
    tenant_id = ((auth.jwt() -> 'app_metadata') ->> 'tenant_id')::uuid
  );

Proof this pattern lands

BoostFrame Enterprise AI (BFEAI) runs seven production AI applications on the multi-tenant architecture described above, including the per-tenant credential routing, the budget-and-rate layer, and the versioned-prompt-per-tenant pattern — the same stack that has generated 200K+ AI-assisted keywords and run 1,500+ AI scans for paying customers across the suite without a documented cross-tenant retrieval incident. The Pinecone namespace and Supabase RLS pieces transfer one-for-one to a customer engagement; the provider-side per-project-key or per-workspace work is the part that gets sized against your existing OpenAI or Anthropic account structure. The author is Bill Fackelman, co-founder and CTO of BoostFrame Enterprise AI.

Outcomes you should expect

What this delivers

Customer A's documents stop surfacing in Customer B's retrieval — the 'why is our chatbot quoting Customer A's pricing to Customer B' incident becomes structurally impossible, not just unlikely.
Per-tenant LLM cost is a number you can see and cap before month-end, instead of a single OpenAI invoice your finance team has to split with a spreadsheet.
Prompt customization per tenant ships as configuration, not a fork — a customer's tone, system prompt, or banned topics live in tenant config, not in branched code.
Eval harnesses run per tenant so a prompt change can be validated against Customer A's real traffic without leaking Customer A's data into Customer B's regression suite.

Industry data

By the numbers

Pinecone's multi-tenancy guidance is to use a serverless index with one namespace per tenant — because each namespace is stored separately, this provides physical isolation of data between tenants and every data plane operation targets exactly one namespace, so a query from Tenant A cannot accidentally return Tenant B's vectors.
Source ↗
Pinecone's namespace-per-tenant pattern is also a cost-control mechanism — query cost is based on namespace size at 1 RU per 1 GB, so querying one tenant's 1 GB namespace costs 1 RU while metadata-filtering a single shared 100 GB namespace for the same tenant would cost 100 RUs.
Source ↗
Anthropic's Claude API enforces rate limits at the organization level and supports per-Workspace spend and rate limit overrides, with separate ITPM and OTPM ceilings per model — which is the mechanism that lets a multi-tenant SaaS cap one customer's AI usage without throttling the rest of the platform.
Source ↗
OpenAI's project-scoped API keys carry their own model-usage permissions and rate limits (RPM, TPM, RPD, TPD, IPM) configurable per model per project, which is how a multi-tenant AI product can attribute spend and enforce ceilings on a per-tenant basis instead of sharing one global organization key.
Source ↗
Supabase's RAG-with-permissions guidance shows that pgvector similarity queries continue to respect Row Level Security policies — when an authenticated user runs a vector search over a table protected by RLS, only embeddings the user has access to are scanned and returned, which makes RLS the enforcement boundary for per-tenant retrieval, not application code.
Source ↗

Live in production today

The same engineering, shipped in production at BFEAI.

I'm co-founder & CTO of Be Found Everywhere (BFEAI), a 7-app AI SaaS platform running today. The work I deliver for clients is the work I do every week on my own platform.

Production apps

200K+

Keywords generated

1,500+

AI scans run

7,000+

Sites automated

Common questions

What buyers ask before reaching out

How do you keep one tenant's documents out of another tenant's RAG retrieval?

Two layers, because one is not enough for AI workloads. In the vector store, every tenant gets its own namespace — Pinecone's documented multi-tenancy pattern, where each namespace is stored separately and every query targets exactly one namespace, so a query from Tenant A cannot physically reach Tenant B's vectors. In the relational store that holds chunk metadata and source documents, Row Level Security on Postgres enforces tenant scope at the database layer — and pgvector similarity queries respect those RLS policies, so even a vector search inside the database only scans rows the requesting tenant owns. The application code still passes tenant_id explicitly, but the leak class of bug becomes structurally impossible rather than relying on every engineer to remember the filter.

Do we need per-tenant API keys for OpenAI or Anthropic, or can we use one shared key?

One shared org-level key is the path most teams start on, and it works until the first customer asks for a usage breakdown or the first customer's runaway agent consumes the month's budget in a weekend. The cleaner architecture is per-tenant project keys on OpenAI (which carry their own RPM, TPM, RPD, TPD, and IPM limits configurable per model) and a workspace per high-value tenant on Anthropic (which supports per-workspace spend and rate limit overrides). For smaller tenants you can group them under a shared workspace with attribution done in your own ledger. The decision is less about isolation and more about where you want the cost-attribution and rate-limit ceiling to live — on the provider, or in your code.

How do you stop one tenant's AI usage from blowing up the whole platform's bill?

A per-tenant cost budget and rate limit, enforced in your backend before the call reaches the provider. The pattern is a token bucket per tenant for requests and a rolling token counter per tenant for input/output tokens, both checked synchronously before issuing the LLM call. Provider-side limits are the second backstop — Anthropic's per-workspace ITPM and OTPM ceilings and OpenAI's per-project TPM limits catch what your own bucket misses. The tenant ledger increments on the response so cached input tokens (which Anthropic does not count toward ITPM for most models) get attributed accurately. The customer sees a 429 from your API with a clear 'tenant budget exceeded' error rather than a confused timeout when the upstream provider rate-limits the whole org.

How does prompt customization per tenant work without forking the codebase?

Prompts live in a versioned tenant-config table, not in source. The application loads the prompt template for the current tenant at request time, fills the variable slots (tenant name, custom instructions, banned topics, persona tone), and sends the assembled prompt to the LLM. The base prompt skeleton is the same across all tenants — what changes is the fillable config. This keeps the codebase singular while letting each tenant feel like the product was built for them. Versioning matters because you will want to roll a config change forward and back without a deploy, and because the eval harness needs to pin the exact prompt version that produced a given output.

What does a per-tenant eval harness look like and why does it matter?

An eval harness is a stored set of input prompts plus expected outputs (or scoring rubrics) that a CI job runs against a candidate prompt version. Per-tenant means each tenant has its own eval set — typically derived from their real usage with PII scrubbed and consent obtained — and a prompt change ships only after passing every tenant's eval. The matter-most part is that the eval data stays scoped: Customer A's eval inputs are visible only to the harness runs that target Customer A's tenant. The same RLS policies and namespace boundaries that protect retrieval protect the eval store, so a prompt regression test cannot accidentally leak Customer A's queries into Customer B's regression report.

Can we use one Pinecone index with metadata filtering instead of one namespace per tenant?

You can, and Pinecone's own docs describe why most multi-tenant AI products do not. Each namespace is stored separately, so namespace-per-tenant is a physical isolation boundary — a leak requires a logic bug at the API layer, not a missed filter in a query. Metadata filtering is a logical filter in a shared namespace, and the leak surface is larger. Cost-wise, namespace-per-tenant is also cheaper at any meaningful scale — querying one tenant's 1 GB namespace costs 1 RU, while metadata-filtering a single shared 100 GB namespace for the same tenant costs 100 RUs per the documented pricing model. The tradeoff used to be operational complexity, but with namespaces-per-tenant supported up to the million-namespace scale on Standard and Enterprise plans, the operational argument has largely evaporated.

What happens to the architecture when a tenant churns and asks for data deletion?

Deletion becomes a small set of well-defined operations instead of an archaeology project. The Pinecone namespace for that tenant is deleted as a single API call — every vector for that tenant goes with it, because nothing was shared. The Postgres rows are deleted via the existing tenant_id-scoped DELETE, which RLS prevents from touching another tenant's data even if the deletion script has a bug. The provider-side project key (OpenAI) or workspace (Anthropic) is revoked, and the tenant-config row carrying their custom prompts is removed. The eval harness drops the tenant's eval set with the same delete. The contractual deletion deadline (often 30 days under enterprise DPAs) becomes a script you can prove ran, not a quarterly cleanup project.

How long does this typically take to stand up on an existing AI SaaS?

For an AI SaaS already running on a shared Pinecone index and a single provider key, retrofitting per-tenant namespaces plus per-tenant key or workspace plus the cost/rate budget layer is typically a few weeks of focused work. The slow variable is the vector migration — re-embedding is sometimes required, sometimes not, depending on whether the existing index has clean tenant_id metadata or whether tenant attribution has to be recovered from logs. The prompt-config and eval-harness pieces are usually smaller than the data migration and can ship in parallel.

Ready to see if this is a fit?

A 15-minute call. No deck, no slides. We talk about what you're shipping and where engineering is the bottleneck. Either way, you walk away with a senior engineer's read on your situation.