A stressed CTO peers at a laptop showing overlapping data visualizations, representing a cross-tenant retrieval leak and data exposure.

Production engineering patternUpdated June 2026

RAG Namespace Isolation: Stop Cross-Tenant Retrieval Leaks

Q: Why is a tenant_id metadata filter not enough for RAG isolation?

It is a convention applied per query. One missing filter leaks the corpus. Use namespace or schema isolation instead.

Q: Namespace per tenant or index per tenant in Pinecone?

Namespace per tenant by default. Use index per tenant only for different model, metric, region, or compliance regime.

Q: What about schema per tenant in pgvector?

Yes, up to a few hundred tenants. Past that, shared schema with RLS and tenant_id partitioning scales better.

Q: Can I run one shared base index plus per-tenant overlays?

It sounds clean and leaks under load. Use one isolated partition per tenant; duplicate shared content into each at write time if needed.

Q: How do I delete a tenant's vectors when they cancel?

One delete-namespace or drop-schema. Filter-based deletes are slow, partial, and hard to defend.

Q: Does namespace isolation hurt retrieval quality?

No, and it often improves recall because another tenant's documents stop crowding the top-k.

Q: What about embedding-time tenant tagging — is that enough?

Tag at write, partition at storage, enforce at retrieval. Tagging alone leaks; partitioning is the boundary.

Q: How do I test that retrieval is actually isolated?

Negative test: query as tenant A for content only tenant B has and assert zero cross-tenant rows. Run it in CI.

Q: Where does embedding cost fit into this decision?

Embedding cost is per-tenant either way. Query cost scales with the queried partition under namespaces, with the full corpus under filter-based isolation.

An engineering pattern for keeping every tenant's documents invisible to every other tenant's retrieval — at the index, not at the filter clause one engineer forgets.

Get a 15-min architecture read

The problem

The failure mode that ends a B2B RAG product is the support ticket that opens with a screenshot. Customer A asks the assistant a question about their own internal policy and the answer cites a paragraph from Customer B's onboarding handbook. The citation is verbatim. The chunk is real. Customer A's compliance team forwards it to their security contact, who forwards it to yours, and the next thirty days are spent explaining how the retrieval layer mixed two corpora.

Shared index plus forgotten filter

The root cause is usually the same shape. Documents from every tenant were embedded into one shared index. The retriever was supposed to pass a tenant_id filter on every call. One code path forgot — a debug endpoint, a background re-ranker, an evaluation harness, a Slack-bot wrapper that someone shipped in a hackathon — and that path returned top-k chunks across the union of every tenant's corpus. The LLM then quoted from those chunks because it is a language model, not an access-control system. The model is not lying; it is faithfully repeating what the retriever handed it.

Filter-based isolation is honor-system

The reason this keeps happening is that filter-based isolation is honor-system. Every retrieve call is one missing argument away from a leak. The code reviewer cannot see the leak by reading the diff — they can only see it by knowing the convention. Pinecone's own multi-tenancy guidance is blunt about this tradeoff: when isolation matters, metadata filtering is the wrong primitive because queries scan the entire namespace regardless of the filter, so the cost is full-corpus and the isolation is whatever the calling code remembered to add. The structural fix is to make the boundary live at the storage layer, where the retriever cannot bypass it even if the developer wants to.

Trust tax after a single chunk leak

The business cost of a single cross-tenant retrieval is more than one angry customer. SOC 2 auditors flag it on every renewal for the next three years. Enterprise prospects' security reviews add a multi-week deep-dive into your RAG isolation story. Sales cycles lengthen because every prospect's general counsel now wants written assurance that "Customer A's docs cannot end up in Customer B's prompt," and your only honest answer is "we filter on every query" — which is the same answer the leaking company gave the week before their incident. The cleanup is not the apology email. It is the trust tax you pay until you can show structurally that the mix-up cannot happen.

Symptoms surface long after the architectural choice

What makes this hard is that the symptom (a chunk from the wrong tenant) is months downstream of the architectural decision (we will isolate by metadata filter to keep things simple). By the time the bug shows up, the corpus is large, the retriever has a dozen call sites, the re-ranker has its own retrieval step, and the migration to real isolation is a full quarter of work. The pattern below gets it right from the start, and the migration playbook at the end of this page is the shortest path out if you are already past that point.

Three engineers collaborate, pointing at a whiteboard with abstract block diagrams illustrating effective RAG namespace isolation for multi-tenant systems.

What changes for your business

The architecture has three layers, and the leak comes from any one of them being wrong. Storage has to partition by tenant — one Pinecone namespace per tenant, or one pgvector schema per tenant, depending on the vector store. Ingestion has to write each tenant's vectors into the tenant's own partition, and every embedding has to carry the tenant_id as belt-and-suspenders metadata. Retrieval has to construct the namespace or schema from the authenticated session — not from a request parameter — so a client cannot ask "give me top-k from namespace X" where X is a different tenant.

Namespace-per-tenant in Pinecone

Start with the storage layer in Pinecone. The documented and supported pattern is one namespace per tenant. Pinecone's own multi-tenancy guide states that in the serverless architecture each namespace is stored separately, which gives you physical isolation between tenants. Reads and writes target one namespace at a time, so one tenant's traffic does not contend with another's. Query cost is sized by the namespace you hit — meaning a thousand-tenant index where you query Tenant A's namespace costs the same as querying a single-tenant index of the same size. Pinecone supports million-scale namespaces on the right plans, with a soft ceiling at 100,000 namespaces before you should contact Support to validate the scaling pattern.

The Pinecone retriever looks like this. Notice that the namespace comes from the session, not from the caller:

import { Pinecone } from "@pinecone-database/pinecone";

const pc = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! });
const index = pc.index("tenant-corpus");

// The namespace is derived from the authenticated session, not from
// request input. A client cannot ask to query a different tenant's
// namespace because they cannot supply the namespace argument.
export async function retrieveForSession(
  session: AuthSession,
  query: string,
  topK = 8,
) {
  const tenantId = session.tenantId; // verified server-side from JWT
  if (!tenantId) throw new Error("no tenant_id on session");

  const queryEmbedding = await embedQuery(query);

  return index.namespace(tenantId).query({
    vector: queryEmbedding,
    topK,
    includeMetadata: true,
  });
}

Ingestion writes into the tenant partition

The matching ingestion path writes every chunk into the tenant's namespace at upsert time. The chunk's metadata also carries the tenant_id as a defense-in-depth tag — that way, if you ever need to migrate a tenant out of one namespace or audit which vectors belong to whom, the answer is in the metadata as well as the namespace.

export async function ingestDocument(
  tenantId: string,
  doc: { id: string; chunks: Array<{ id: string; text: string }> },
) {
  const vectors = await Promise.all(
    doc.chunks.map(async (chunk) => ({
      id: `${doc.id}:${chunk.id}`,
      values: await embedDocument(chunk.text),
      metadata: {
        // Belt-and-suspenders: the tenant_id is on the row even though
        // the namespace is also the tenant_id. If we ever export, audit,
        // or migrate, the metadata is self-describing.
        tenant_id: tenantId,
        doc_id: doc.id,
        chunk_id: chunk.id,
      },
    })),
  );

  // Writes go to the tenant's namespace by name. A bug in the chunker
  // cannot scatter chunks across tenants because the namespace argument
  // is the partition.
  await index.namespace(tenantId).upsert(vectors);
}

Schema-per-tenant in pgvector

On the pgvector side the equivalent pattern is schema per tenant for small to mid tenant counts. Each tenant gets a schema; each schema has its own documents table with a vector column and its own HNSW or IVFFlat index. The pgvector index choice matters here — HNSW has better recall-speed tradeoff at the cost of slower builds and more memory; IVFFlat is faster to build and lighter on memory but lower-recall — and per-tenant schemas let you make that tradeoff per-tenant if some customers have very different corpus sizes.

-- One schema per tenant. The connection sets the search_path so
-- queries land in the right one without any code path having to
-- name it explicitly.
CREATE SCHEMA tenant_a;
CREATE SCHEMA tenant_b;

-- Same shape inside each schema. Migrations apply to every tenant.
CREATE TABLE tenant_a.documents (
  id           uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  doc_id       text NOT NULL,
  chunk_id     text NOT NULL,
  content      text NOT NULL,
  -- OpenAI text-embedding-3-small returns 1536-dim vectors. Match the
  -- dimension to the embedding model — mismatched dims is a class of
  -- bug pgvector will not save you from.
  embedding    vector(1536) NOT NULL,
  created_at   timestamptz NOT NULL DEFAULT now()
);

-- HNSW index for the cosine-distance retrieval path. Build once per
-- tenant; the index lives inside the tenant schema so a slow build
-- for tenant_b does not block reads on tenant_a.
CREATE INDEX documents_embedding_hnsw_idx
  ON tenant_a.documents
  USING hnsw (embedding vector_cosine_ops);

Search_path set from the verified session

The connection layer is what makes this safe at the retriever. When a request lands, the API layer authenticates the session, resolves the tenant_id, and issues SET LOCAL search_path = tenant_a, public at the start of the transaction. From that point on, every query the retriever issues against the unqualified table name documents resolves to tenant_a.documents. If the retriever forgets to set the search_path, the query fails because documents does not exist in the default search_path. The boundary enforces itself — there is no "fall through to a shared table" path that the retriever can accidentally use.

import { sql } from "drizzle-orm";

export async function retrieveFromTenantSchema(
  session: AuthSession,
  query: string,
  topK = 8,
) {
  const tenantId = session.tenantId;
  if (!tenantId) throw new Error("no tenant_id on session");

  const queryEmbedding = await embedQuery(query); // 1536-dim from text-embedding-3-small

  return db.transaction(async (tx) => {
    // Scope this transaction's namespace resolution to the tenant.
    // If the schema does not exist or is misspelled, the next query
    // errors loudly instead of silently falling back to public.
    await tx.execute(
      sql.raw(`SET LOCAL search_path = ${escapeIdent(`tenant_${tenantId}`)}, public`),
    );

    return tx.execute(sql`
      SELECT id, doc_id, chunk_id, content,
             1 - (embedding <=> ${queryEmbedding}::vector) AS similarity
      FROM documents
      ORDER BY embedding <=> ${queryEmbedding}::vector
      LIMIT ${topK};
    `);
  });
}

A confident technical founder views a monitor with a clean, segmented dashboard, visualizing secure multi-tenant RAG data isolation with distinct color blocks.

Why filter-based isolation is the dangerous shortcut

Filter-based isolation is the path most teams take first, because it looks like the smallest change. One shared index, one shared table, a tenant_id column or metadata field, and a WHERE tenant_id = $1 (or its Pinecone metadata-filter equivalent) on every retrieval. It works in development. It works for the first six months in production. Then it breaks in three predictable ways.

First, the retriever surface grows past the discipline that protects it. You start with one retrieval call site. By the time you ship hybrid retrieval, a re-ranker, a multi-step agent, a debug endpoint, an evaluation harness, an offline indexer, and a Slack bot, you have a dozen places that issue retrieve calls. The filter has to be passed in every one of them, in the right shape, from the right session context. One missing argument leaks. The Postgres equivalent is the missing WHERE tenant_id = $1 clause that ends multi-tenant SaaS companies, and the RAG version is the same bug with vectors instead of rows.

Second, the cost model punishes you. Pinecone's documentation states that filter-based queries scan the entire namespace regardless of the filter — meaning a tenant who has uploaded 200 documents pays the scan cost of every other tenant's documents in the same index too. As the index grows, the per-query cost grows with it, and the cost is spread across tenants by total corpus size, not by what each tenant actually queries. Your largest tenants subsidize your smallest ones, and your smallest tenants pay for data they cannot see. The math gets worse, not better, as you grow.

Third, the deletion story falls apart when a tenant cancels. With namespace isolation, "delete every vector for Customer X" is one operation that completes against a single namespace. With filter-based isolation, it is a scan across the full index, deleting matching rows in batches, holding locks (or eventual-consistency drift on serverless), and racing against new writes. Privacy regulations want a verifiable boundary — "the data lived in partition X, partition X is gone" — and a filter sweep is hard to defend because it is hard to prove completeness.

The "shared base index plus per-tenant overlay" variant is the sophisticated trap. The idea is appealing: one big shared corpus of common content (product docs, public FAQs, the company knowledge base) plus a small per-tenant overlay (the customer's own documents). At query time you retrieve from both and merge. In practice the retriever has to do two queries and merge top-k by score across two embedding spaces that have drifted in slightly different ways. The merge is one of the trickiest pieces of code in your stack, every change to the base index potentially affects every tenant's retrieval, and the isolation boundary lives in the merge logic — which is exactly the place that loses its discipline when you ship a re-ranker or a hybrid retriever. The pattern that survives is to duplicate shared public content into every tenant's namespace at write time. Storage is cheap; isolation discipline is expensive.

Embedding-time tenant tagging is necessary but not sufficient

A lot of teams reach for embedding-time tagging as the fix and stop there. The pattern is: at ingestion, include the tenant_id in the chunk's metadata; at retrieval, filter by tenant_id; done. The metadata is real, the filter is real, but the isolation is back to honor-system. The boundary is still in the retriever's filter clause.

The reason to tag at write time is not isolation — it is the operational migration story. Tagged vectors are auditable. You can sweep the index, count vectors per tenant_id, validate the count against the tenant's document table, and detect drift before a customer does. You can repartition a tenant out of a shared index into its own namespace because the metadata tells you which vectors belong to whom. You can verify that a deleted tenant has no leftover vectors by querying for tenant_id = $deleted_id and asserting zero rows. The tag is the audit primitive.

Layer all three: tag at write, partition at storage, enforce at retrieval by constructing the namespace from the authenticated session. Tagging without partitioning leaves the boundary in the retriever. Partitioning without tagging leaves you without an audit primitive. Enforcing only at retrieval leaves the ingestion path free to scatter chunks across tenants. All three together is the only configuration that holds up to growth and to the security review.

Common failure modes

The first sharp edge is the retriever that accepts a namespace as a request parameter. The endpoint looks like POST /retrieve { tenantId, query } and the handler passes tenantId straight through to the Pinecone client. The bug is that the client cannot be trusted to send their own tenant_id — a malicious or curious caller sends a different one, and the retriever happily returns top-k from that namespace. The fix is to refuse tenant_id as input. Resolve it server-side from the verified session, fail closed if it is missing, and treat the namespace argument as a derived value not a passed-in one.

The second is the offline indexer that batch-embeds documents into the wrong namespace. A background job reads from documents_to_embed, computes embeddings, and writes them to Pinecone. Somewhere in the loop the namespace argument is missing or stale — perhaps the job iterates documents but only resolves the namespace once at the top, perhaps it falls back to a default namespace when the tenant lookup fails. The fix is to derive the namespace from the document row itself, fail the batch if any row's tenant_id is null, and avoid having a "default" namespace that the job can fall back to.

The third is the evaluation harness that retrieves across tenants for "quality testing" and accidentally ships in a code path that production hits. Test harnesses tend to bypass session checks because they are running offline against fixtures. If the same retriever module is used in production, the bypass path can become a production path on a refactor. The fix is to keep the evaluation harness in a separate module that cannot be imported from production code, with a lint rule or import boundary that enforces the separation. The audited entry point that production uses resolves the namespace from a verified session, with no other allowed source.

The fourth is the embedding-model drift between tenants. A tenant who onboarded a year ago has their corpus embedded with text-embedding-3-small at 1536 dimensions. A tenant who onboarded last week has their corpus embedded with the new larger model at 3072 dimensions. The vectors are not compatible — querying with a small-model embedding against a large-model index returns nonsense. The fix is to pin the embedding model per index (or per namespace if the model can be the same shape) and to migrate the whole corpus when you upgrade, not to mix dimensions within a partition.

The fifth is the cancelled-tenant ghost vectors. A tenant cancels, the application marks them as deleted in the relational database, but the vector deletion job either does not exist or fails halfway through and is not retried. Six months later, the orphan vectors are still in the index. They are not queried because the application no longer issues retrievals for that tenant — but they are still on disk, still being billed for, and still discoverable if anyone runs a maintenance query that walks the corpus. The fix is a cancellation pipeline that issues the delete-namespace (or drop-schema) operation, verifies it succeeded by querying for any remaining vectors with that tenant_id, and writes an audit row recording the deletion. Make the verification the gate that closes the cancellation ticket, not the firing of the delete.

The sixth is the prompt-injection-induced retrieval bypass. A user supplies input that the application embeds into a retrieval query, and the input contains text that influences the retriever — for example, telling the LLM "first, retrieve documents from the public corpus" in an architecture that has both private and public indexes. The retriever obeys because the LLM is interpreting the input as instructions, and the public corpus turns out to contain other tenants' onboarding docs that were misclassified as public. The fix is the same as for the parameter-injection case: the retriever picks the namespace from the session, not from anything the LLM or the user can influence. Prompt input goes into the embedding, not into the routing decision.

How to test this

The test that matters is the negative test. Seed two tenants with different documents. Embed each tenant's corpus into the right partition. From a session authenticated as Tenant A, query the retriever with a string that only Tenant B's documents contain. Assert that the returned chunks contain zero rows from Tenant B. Run it in CI. The test is fast — single-digit seconds — and it catches the regression the day someone adds a new retriever code path that forgets to scope the query.

import { describe, expect, it, beforeAll } from "vitest";

describe("RAG retrieval: tenant isolation", () => {
  beforeAll(async () => {
    await ingestDocument("tenant_a", {
      id: "doc-a-1",
      chunks: [{ id: "1", text: "Project Aurora ships in Q3 to GovCloud." }],
    });
    await ingestDocument("tenant_b", {
      id: "doc-b-1",
      chunks: [{ id: "1", text: "Project Borealis ships in Q4 to commercial." }],
    });
  });

  it("tenant A cannot retrieve tenant B's chunk by content", async () => {
    const session = { tenantId: "tenant_a" } as AuthSession;
    const result = await retrieveForSession(session, "Project Borealis Q4", 8);

    const leakedFromB = (result.matches ?? []).filter(
      (m) => m.metadata?.tenant_id === "tenant_b",
    );
    expect(leakedFromB).toEqual([]);
  });

  it("tenant A retrieving their own content still works", async () => {
    const session = { tenantId: "tenant_a" } as AuthSession;
    const result = await retrieveForSession(session, "Project Aurora Q3", 8);
    const ids = (result.matches ?? []).map((m) => m.id);
    expect(ids).toContain("doc-a-1:1");
  });

  it("retriever fails closed when session has no tenant_id", async () => {
    const session = {} as AuthSession;
    await expect(retrieveForSession(session, "anything", 8)).rejects.toThrow();
  });
});

Three properties make this test useful where a happy-path test is not. It exercises the same retriever module the application uses, so any divergence between the isolation pattern and the runtime shows up here. It asserts the silent-zero-rows behavior — that is what a real cross-tenant query looks like when isolation is working, and it is also what an attacker probe looks like, so the test maps to the threat. And it covers the fail-closed case where the session is broken: a retriever that returns chunks instead of erroring when the tenant_id is missing has a worse failure mode than one that errors.

What to watch in your own implementation

Open the codebase and search for every call to the vector store's query method — index.query, index.namespace(...).query, documents.embedding <=> ..., whatever shape your client uses. For each one, answer two questions. Where does the namespace or schema argument come from? Is that source a verified server-side session, or is it a request parameter, a function argument from another module, or a default that falls back to "all tenants" if the lookup fails? Any retriever where the answer is not "verified server-side session" is the next leak, and the fix is to refactor it into the audited retrieve helper that the rest of the codebase uses.

Next, audit the ingestion paths. Every code path that writes vectors — the synchronous indexer, the batch backfill job, the manual re-embed script, the import-from-customer-export pipeline — has to derive the namespace from the document being ingested, not from a job-level default. Search for index.upsert and INSERT INTO documents and verify the partition is per-row, not per-batch. If the batch has a single namespace argument and iterates documents inside it, one wrong namespace on a batch corrupts every document in that batch.

Then run a one-shot audit query against the vector store. For Pinecone, list namespaces, count vectors per namespace, and cross-check against the tenant table — every tenant should have a namespace, and every namespace should map to a known tenant. Orphan namespaces are cancelled tenants whose deletes did not finish. Tenants with zero namespace are tenants whose ingestion is broken. For pgvector, the equivalent is a SELECT schema_name FROM information_schema.schemata joined to the tenant table, plus a per-schema SELECT COUNT(*) FROM documents. Either way, you want a single sheet that says "every tenant has the partition they should and nothing else."

Finally, write the negative integration test if you do not already have one. Two seeded tenants, a cross-tenant retrieval, an assertion that the result contains zero rows from the other tenant. The test takes an hour to write and runs in seconds, and it is the single test that proves the isolation pattern is doing what the architecture diagram says it does. It is also the test that catches the regression when someone adds a re-ranker, a hybrid retriever, an agent loop, or any of the other retrieval surfaces that grow into the codebase over the year after launch.

What this looks like in production

At BFEAI the RAG architecture uses namespace-per-customer in the managed vector store, with a small relational table mapping customer_id to the namespace name. Every retrieval is initiated from an authenticated session that resolves the customer_id server-side, and the namespace argument to the vector client is derived from that resolved value — there is no code path in production retrieval where the namespace comes from request input. Ingestion writes into the customer's namespace by name and stamps the customer_id into vector metadata as a redundant audit primitive. Cancellation runs a delete-namespace operation and then queries the metadata index for any vectors still tagged with the cancelled customer_id, treating a non-zero count as a paging alert.

The audit story that this enables is what makes it worth the up-front work. When a prospect's security team asks where Customer A's vector data lives, the answer is a namespace name, not a filter clause. When a SOC 2 reviewer asks how cancellation works, the answer is a single operation against a named partition plus a verification query, not a multi-step sweep across a shared index. When an engineer ships a new retrieval surface — a Slack bot, a new agent tool, a debug endpoint — the audited retrieve helper is the obvious entry point, and the lint rule that bans direct calls to the vector client outside that helper catches the cases where someone takes a shortcut. The boundary is structural, the audit primitive is metadata, and the operational discipline is "use the helper." That combination is what survives the next year of feature work without turning into the screenshot in a support ticket.

Outcomes you should expect

What this delivers

A retrieval call with the wrong tenant context returns zero documents instead of leaking another customer's chunks into the prompt.
Embedding writes physically land in the tenant's own partition, so an indexing bug cannot scatter Tenant A's documents into Tenant B's retrieval surface.
Per-tenant deletes complete in one operation against one namespace, instead of a scan-and-filter across a shared index.
A SOC 2 reviewer can answer 'where does Customer A's vector data live' by name, not by a filter clause that has to be re-proven on every code change.

Primary sources

By the numbers

In Pinecone's serverless architecture each namespace is stored separately, so using namespaces provides physical isolation of data between tenants and customers.
Source ↗
Within a Pinecone index records are partitioned into namespaces, and all upserts, queries, and other data operations always target one namespace.
Source ↗
Pinecone documents that metadata-filter-based multitenancy queries scan the entire namespace regardless of filters, so you pay for scanning every tenant's data even though results are filtered to one tenant.
Source ↗
Pinecone's metadata filter $in and $nin operators are limited to 10,000 values, which bounds how many tenant IDs a single query can include.
Source ↗
On the Standard and Enterprise plans Pinecone can accommodate million-scale namespaces and beyond, and applications that need more than 100,000 namespaces should contact Support.
Source ↗
pgvector's HNSW index has better query performance than IVFFlat in terms of speed-recall tradeoff but has slower build times and uses more memory.
Source ↗
OpenAI's text-embedding-3-small returns 1536-dimensional vectors and text-embedding-3-large returns 3072-dimensional vectors, with a maximum input of 8192 tokens for both.
Source ↗

Live in production today

The same engineering, shipped in production at BFEAI.

I'm co-founder & CTO of Be Found Everywhere (BFEAI), a 7-app AI SaaS platform running today. The work I deliver for clients is the work I do every week on my own platform.

Production apps

200K+

Keywords generated

1,500+

AI scans run

7,000+

Sites automated

Common questions

What buyers ask before reaching out

Why is a tenant_id metadata filter not enough for RAG isolation?

Because the filter is a query-time convention applied on top of a shared corpus. The vectors still live together in one index, and any code path that forgets the filter — a debug script, a misconfigured retriever, a future engineer copying the search call — returns rows from every tenant. The fix is structural: put each tenant's vectors in a separate namespace or schema so the isolation is enforced by the storage layer, not by every caller remembering to add a clause.

Namespace per tenant or index per tenant in Pinecone?

Namespace per tenant is the documented default and scales to millions of tenants on the right plan. Index per tenant is for the edge cases where a tenant needs a different embedding model, a different metric, or a different region for compliance reasons. The cost story also pushes you toward namespaces: query cost is sized by the namespace you hit, so a thousand-tenant index with namespaces is cheaper to query than a thousand separate indexes that each carry their own baseline cost.

What about schema per tenant in pgvector?

Schema per tenant works well below a few hundred tenants and gives you physical separation plus per-tenant migrations and backups. Past that point the operational cost — one set of indexes per tenant, one set of stats per tenant, schema migrations multiplied by tenant count — starts to dominate. Most teams I work with stay on schema-per-tenant until the schema list crosses a couple hundred, then move to a shared schema with RLS plus a tenant_id partition key for the largest tables.

Can I run one shared base index plus per-tenant overlays?

It sounds clean and breaks badly in practice. Retrieval has to merge two result sets and re-rank them, the embedding model can drift between the base and overlay corpora, deletions in the base index silently affect every tenant, and the isolation boundary is enforced at query time by the retriever code — exactly the place that leaks. The pattern that actually works is one isolated partition per tenant, with shared public content duplicated into each tenant's partition at write time if you need it there.

How do I delete a tenant's vectors when they cancel?

With namespace per tenant, one DELETE NAMESPACE operation removes every vector. With schema per tenant, one DROP SCHEMA does the same. With shared-index plus filter, you have to iterate the corpus and delete by tenant_id, which is slow, lock-prone, and easy to leave half-finished. Pick the isolation pattern that makes the deletion a single operation — for both performance reasons and because privacy regulations like GDPR want a verifiable boundary, not 'we ran a filter and trusted it.'

Does namespace isolation hurt retrieval quality?

Only if you needed cross-tenant retrieval in the first place — which most B2B SaaS does not. Each tenant's queries hit their own vectors, ranked against their own corpus, with their own embedding history. If anything, retrieval quality goes up because the candidate pool is the tenant's documents instead of a mixed corpus where another tenant's chunks compete for the top-k slots.

What about embedding-time tenant tagging — is that enough?

It is necessary but not sufficient. Tagging at write time gives you the metadata to audit and to repartition later. It does not give you isolation if those vectors all land in the same shared partition, because retrieval still scans the union and depends on filter discipline. Tag at write, partition at storage, enforce at retrieval — all three layers.

How do I test that retrieval is actually isolated?

The same way you test RLS in Postgres: a negative test. Seed two tenants with different documents, call the retriever as tenant A asking for a string only tenant B's documents contain, and assert the top-k contains zero rows from tenant B. Run it in CI. The test is fast, and it catches the regression the day someone adds a new retriever code path that forgot to pass the namespace argument.

Where does embedding cost fit into this decision?

Embedding cost is per-tenant in any of these patterns — each tenant's documents get embedded once at ingestion. What changes with isolation pattern is query cost. Namespace-isolated queries hit only the target tenant's vectors, so cost scales with that tenant's corpus. Shared-index-with-filter queries scan the full corpus regardless of the filter, so cost scales with the total — meaning your biggest tenant subsidizes the smallest ones, and your smallest tenants pay for the data they cannot see.

Ready to see if this is a fit?

A 15-minute call. No deck, no slides. We talk about what you're shipping and where engineering is the bottleneck. Either way, you walk away with a senior engineer's read on your situation.