Startup engineers look worried at complex data flows on a monitor, concerned about PHI handling with rag and llm agents for telehealth saas.
For your stackUpdated

RAG & LLM Agents for Telehealth SaaS — HIPAA-Safe Clinician Drafts

Cited clinician drafts that save real minutes per encounter, on PHI-safe RAG with BAA-eligible model routing and an audit log compliance can read on its own.

The problem

A clinician-facing AI feature in a telehealth product fails in places generic SaaS AI features do not. Four failures cluster together because teams ship the chat-with-your-docs demo first and bolt HIPAA on after.

PHI in the embedding store

A developer wires up a vector store, indexes the chart corpus to make retrieval "smart," and now patient names, diagnosis codes, and clinical narrative live inside an embedding model the company has no BAA with. The embedding step is where the PHI boundary gets crossed silently, weeks before anyone notices.

Drafts without citations

Clinicians will not act on a draft they cannot verify, so a confident summary without source links is a draft they re-derive by hand, and the time savings disappear. The product saves minutes on paper and adds them back in the review step.

Partial-answer rendering

An agent in the middle of a tool-call loop times out, the UI renders what it had, and a clinician sees a note that looks complete but is missing the medication reconciliation step. A half-rendered draft that looks finished is worse than no draft at all.

Unbounded model routing

A single deploy ships PHI to whichever model the team prefers that week, and nobody can prove which provider saw which patient on which date when the compliance review comes. The audit answer at six months is a search through git history, not a query against an audit log.

These failures cluster because the team built the AI feature the same way they built the chat-with-your-docs demo and then bolted "HIPAA" on top of it. The fix is not a settings checkbox. It is a different shape of architecture — one that draws the PHI boundary at the SDK call site, splits retrieval into a non-PHI reference corpus and per-tenant patient stores, requires every clinician-facing claim to carry a citation, refuses to render a partial answer, and logs every LLM call against the standard a Security Rule auditor expects to see. Most of that work is invisible to the clinician. What they see is a draft that saves them ten minutes per encounter and a source link under every claim. What the platform sees is a feature that does not turn into a breach notification.

Engineer intently codes on multiple screens, abstract visuals showing dual-store retrieval for rag and llm agents for telehealth saas.

What changes for your business

A telehealth-aware RAG and agent stack treats PHI as routed data, not contextual data.

Typed BAA-routed SDK wrapper

The SDK wrapper that fronts every LLM call carries a typed argument for which BAA surface the call is allowed to hit, and the router rejects any PHI-bearing call to a non-eligible provider at the wrapper layer rather than at review time. The misroute becomes a compile error rather than a code-review concern.

Dual-store retrieval boundary

The retrieval layer splits into two stores with different rules: a shared, non-PHI reference corpus (clinical guidelines, drug references, payer policy, internal SOPs) that any tenant can query, indexed in any embedding provider you have a commercial agreement with — and a per-tenant patient store that lives under row-level isolation, only embeds when the tenant has explicitly authorized it, and only routes through models under a signed BAA. The two stores do not join inside the same retrieval call, which is what keeps the minimum-necessary boundary visible in code instead of in a policy document.

Clean-terminal-state clinician copilot

The clinician-copilot loop on top of that retrieval treats partial answers as a defect, not a degraded state. Every tool call is wrapped in a typed retry with a max-depth, a per-call timeout, and a clean terminal-state requirement; the UI renders a draft only when the loop closes cleanly, and surfaces a 'draft unavailable, fall back to manual' state when it does not.

Citation-backed drafts

Every claim in the rendered draft carries a citation — backed by Anthropic's citations feature or an equivalent custom layer on OpenAI — that resolves to the exact span in the underlying source document, so the clinician's review is a span-by-span confirm rather than a re-read.

Append-only audit log keyed to PHI

Every LLM call that touched PHI lands in an append-only audit log carrying provider, model, request and response hashes, token counts, citation IDs, tool calls made, latency, and a BAA-route flag, which is what lets the compliance officer answer 'what did the model say to this clinician about this patient on this date' six months later without filing a ticket with engineering.

The outcome for the business is that the AI feature becomes the part of the product that drives clinician retention rather than the part the team is nervous about. Drafts save real minutes per encounter — independent academic work has documented roughly sixteen minutes of documentation time saved per eight hours of patient care for clinicians who use ambient AI tooling on more than half of visits. The platform avoids owning a HIPAA incident because PHI structurally cannot reach a non-BAA route. And the audit log is something compliance can read on its own, which is the single largest predictor of whether a clinical AI feature survives its first security review.

A confident CTO reviews a clean dashboard with green checkmarks, reflecting successful HIPAA compliance for rag and llm agents for telehealth saas.

More on this

What gets shipped for telehealth-specific RAG and agents

A telehealth-aware build leaves your repository with the layer between the LLM SDK and your clinical product, shaped to the specific failure modes the vertical introduces. Concretely:

  • A typed model router with a per-call argument for which BAA surface the request is allowed to hit. A PHI-bearing call to a non-eligible route is a compile-time error or a runtime rejection at the wrapper, not a code review concern. Feature flags (web search, model variants, structured-output modes) that fall outside the BAA are blocked at the same layer.
  • A dual-store retrieval layer. A shared non-PHI reference corpus — clinical guidelines, drug references, payer policy, internal SOPs — gets embedded once and queried across tenants. A per-tenant patient store lives under row-level isolation on pgvector or per-tenant namespaces on Pinecone, only embeds chart data when the tenant has explicitly authorized the embedding under a BAA-eligible provider, and does not join the reference corpus inside the same retrieval call.
  • A clinician-copilot agent loop with a max-depth, per-call timeouts, typed error contracts, and a hard rule that partial answers do not render. A transient failure retries with backoff; a hard failure surfaces a 'draft unavailable' state. The clinician sees a complete draft or no draft, not a half-draft that looks complete.
  • A citation layer — Anthropic's citations feature for Claude-routed calls, an equivalent quote-extraction layer for OpenAI-routed calls — that returns exact cited spans per claim. The clinician UI renders the underlying source paragraph inline under each cited claim, so review is a confirm-or-reject per span rather than a re-derivation of the whole note.
  • An append-only LLM audit log with the fields a Security Rule reviewer expects: user ID, timestamp, action, object touched, plus LLM-specific provider, model, request hash, response hash, token counts, citation IDs returned, tool calls made, latency, and BAA-route flag. The log is queryable on its own, retains for the duration your compliance program requires, and is the artifact the security review reads.
  • An eval harness that runs a frozen clinical-question set on every prompt or model change, fails the deploy on cross-tenant retrieval, fails on a PHI-bearing call routing to a non-eligible provider, and reports the cost and behavior delta of the change before it ships. The eval is what turns a prompt change from a leap of faith into a measured deploy.
  • Runbooks for the failure modes that actually happen in production telehealth: a model deprecation on the BAA-eligible surface, a vector index drift after a re-embedding of the reference corpus, a tenant whose token usage suddenly 10x's, a tool that starts returning malformed JSON after a vendor upgrade, a citation feature that regresses on a new model version.

What clinical product leaders ask first

Founders and clinical product leaders evaluating this work tend to ask the same questions. "Will the clinician actually use it?" — only if the draft cites its sources and refuses to render partial; both are non-negotiables that the build enforces structurally. "Can we use ChatGPT or Claude with PHI?" — yes, on the BAA-eligible surfaces with the right configuration, enforced at the SDK wrapper rather than by review. "How do we prove no PHI left the boundary?" — the audit log shows every call's route flag and the eval harness fails the build if a PHI-bearing path can resolve to a non-eligible provider. "What happens when the model gets deprecated?" — the wrapper isolates provider and model so the swap is a config change, and the eval harness reports the behavior delta before the swap ships. "Will my engineers be able to maintain this?" — yes, because the code lives in your repository, follows your existing conventions, and the runbooks cover the specific 2am failure modes the vertical introduces.

Proof this pattern lands

BoostFrame Engineering AI runs seven production applications on the same RAG, agent-orchestration, and audit-logging stack described above — across a six-engine LLM orchestration (Claude, ChatGPT, Gemini, Perplexity, AI Overview, AI Mode), with the production load that has generated 200K+ AI-assisted keywords, run 1,500+ AI scans, and automated work for 7,000+ customer sites. BFEAI is not a telehealth product, and we do not pretend it is. What transfers is the architecture: the typed model router, the dual-store retrieval boundary, the citation-aware agent loop, the append-only audit log, the eval harness that fails the deploy on a class of bug rather than a specific bug. The telehealth-specific work — the BAA routing surface, the PHI-vs-reference store split, the clinical citation UI, the audit log schema your security team needs to read — is the part we architect against your existing compliance regime and your specific chart data shape, not something we bring in pre-built. The author is Bill Fackelman, co-founder and CTO of BoostFrame Engineering AI.

Outcomes you should expect

What this delivers

  • Clinicians get a draft note or chart summary that saves roughly 10 minutes per encounter without the platform owning a HIPAA incident.
  • Every clinician-facing claim links to its source document, so the review step is a fast confirm rather than a re-read of the chart.
  • PHI stays out of embedding contexts and system prompts unless the tenant has explicitly authorized it under a BAA-eligible model route.
  • Every LLM call that touches PHI lands in an append-only audit log a compliance officer can read on their own without filing a ticket.

Industry data

By the numbers

  • Anthropic's BAA is only available to customers using HIPAA-eligible services and requires specific configuration limits — for example, the BAA does not apply to the web search functionality — which means telehealth teams must route PHI-bearing calls only through the eligible API surface.

    Source ↗

  • OpenAI's documentation states that Web Search with live internet access is not HIPAA eligible and is not covered by a BAA, while the offline cache-only mode is BAA-eligible when used with a Zero Data Retention project — so the BAA boundary cuts through individual tool configurations, not just whole APIs.

    Source ↗

  • The HIPAA Security Rule at 45 CFR 164.312(b) requires covered entities to implement audit controls — hardware, software, and procedural mechanisms that record and examine activity in information systems containing or using electronic PHI — which extends to every LLM call that touches PHI.

    Source ↗

  • HIPAA's minimum necessary standard at 45 CFR 164.502(b) limits any use, disclosure, or request of PHI to the smallest amount reasonably needed to accomplish the purpose, which constrains what can land in an embedding model, a system prompt, or a retrieved context window.

    Source ↗

  • Anthropic's citations feature returns exact cited_text spans with character or page indices into the source document, so a clinician-facing draft can be wired to surface the underlying source for every claim — and cited_text does not count toward output tokens.

    Source ↗

  • A quasi-experimental study of ambient AI scribes across 1,800 clinicians at five academic medical centers from 2023 to 2025 found roughly 16 minutes of documentation time saved and 13 fewer minutes in the EHR per eight hours of patient care, with savings concentrated among clinicians who used the tool on more than half of visits.

    Source ↗

Live in production today

The same engineering, shipped in production at BFEAI.

I'm co-founder & CTO of Be Found Everywhere (BFEAI), a 7-app AI SaaS platform running today. The work I deliver for clients is the work I do every week on my own platform.

7

Production apps

200K+

Keywords generated

1,500+

AI scans run

7,000+

Sites automated

Common questions

What buyers ask before reaching out

Can a telehealth SaaS use Claude or GPT for clinical drafts at all?

Yes, but only on the BAA-eligible surface and only with the right configuration. Anthropic's BAA covers HIPAA-eligible services with specific feature restrictions — for example, the BAA does not apply to web search. OpenAI's BAA likewise covers most API endpoints but not features like live web search. The architecture decision is routing every PHI-bearing call through the eligible surface, blocking the ineligible features at the SDK wrapper layer, and proving in audit that no PHI ever left the boundary. The model itself is the easy part; the routing and the proof are the work.

What does PHI-safe RAG actually mean in practice?

It means the embedding model, the vector store, and the retrieval context window only see PHI when the tenant has explicitly authorized that route under a BAA-eligible provider — and a different, non-PHI corpus (clinical guidelines, drug references, payer policy) gets indexed in a separate namespace that any tenant can query. Patient-specific retrieval lives on a per-tenant, per-patient scope with row-level isolation at the store; reference retrieval lives in a shared, non-PHI namespace. The two do not join inside the same retrieval call, which keeps the minimum-necessary boundary visible in code.

Why do clinicians demand citations on every claim?

Because a draft without sources is a draft they have to re-derive themselves, which removes the time savings entirely. Anthropic's citations feature returns exact cited_text spans with index ranges into the source document, so the UI can render the underlying paragraph from the chart or the guideline for every clinician-facing claim. The review becomes a confirm-or-reject on each cited span instead of a full re-read. The cost is also favorable — Anthropic's citations feature does not bill cited_text toward output tokens.

How do you handle a tool call that fails partway through a clinician's draft?

You do not show a partial answer mid-encounter. The agent loop wraps every tool call in a typed retry with a max-depth and a per-call timeout, and the UI only renders a draft when the loop reaches a clean terminal state. A transient failure (network, timeout, malformed tool response) triggers a retry with backoff; a hard failure surfaces an explicit 'draft unavailable, fall back to manual' state rather than a half-rendered note. The clinician's worst experience is a missing draft, not a wrong draft that looks complete.

What goes in the audit log for an LLM call?

The minimum the HIPAA Security Rule expects an audit control to capture for ePHI access — user ID, timestamp, source, action, and the object touched — plus LLM-specific fields: provider, model, request hash, response hash, token counts, citation IDs returned, tool calls made, latency, and whether the call hit a BAA-eligible route. The log is append-only and lives long enough to satisfy retention. The point of the log is not just compliance evidence; it is what lets you answer 'what did the model say to this clinician about this patient on this date' six months later when a review comes in.

How do you decide which model to route a call to when PHI is involved?

Routing is a typed decision based on the data the call carries. A retrieval that pulls from the shared non-PHI reference corpus can route to any provider you have a commercial agreement with. A retrieval that pulls patient-specific PHI must route to a model under a signed BAA — currently a narrow set including specific Anthropic and OpenAI API surfaces with the right configuration. The router rejects PHI-bearing calls to non-eligible providers at the wrapper layer, so the mistake of sending a patient note to a non-BAA endpoint becomes impossible by construction rather than guarded by review.

How long does this kind of build take?

For a telehealth SaaS with a working clinical product and an existing BAA-eligible model account, the PHI-safe RAG plus clinician-copilot agent loop plus audit log is typically a 4 to 8 week build, scoped to one or two clinician-facing surfaces. Most of the time goes into the dual-store retrieval boundary, the citation rendering UI, and the audit log schema — not into LLM prompts. The slow variable is usually how clean the chart data is on your side; the LLM surface area is small once the boundary is drawn.

Ready to see if this is a fit?

A 15-minute call. No deck, no slides. We talk about what you're shipping and where engineering is the bottleneck. Either way, you walk away with a senior engineer's read on your situation.