Joshua Fields — Software Engineer & Creative Technologist

Building Real-Time Voice AI Applications with LiveKit and FastAPI

A practical architecture for production voice interactions

Why voice systems feel easy in demos and hard in production

When I work on real-time AI products, voice is usually where teams discover the difference between a good demo and a stable system. The demo often has one clean interaction, one happy path, and no load. Production has packet jitter, user interruptions, reconnects, flaky speech recognition, and all the little timing issues that make a conversation feel robotic if you do not design for them upfront.

This write-up explains how I think about building real-time voice AI applications with LiveKit and FastAPI in a way that can actually ship. It is less about one framework trick and more about architecture decisions: where state lives, where latency accumulates, how retries work, and what to observe before users tell you something feels off.

Reference architecture at a glance

A practical voice stack has a few clear layers:

Client: browser or mobile app that captures mic input, streams audio, and plays synthesized responses.
Voice session layer: session identity, auth tokens, connection lifecycle, and per-user context.
LiveKit room: low-latency media transport and participant coordination.
STT pipeline: speech-to-text with partial and final transcripts.
LLM orchestration: prompt construction, tool calls, memory policy, and response shaping.
TTS pipeline: text-to-speech chunks streamed back to the user.
Backend APIs: FastAPI services handling session state, business actions, and persistence.
Observability: metrics, traces, logs, and replay signals for latency and failure analysis.

I try to keep each layer independently testable. If the orchestration logic can only run when a full audio room is active, debugging becomes painful fast.

Client and session boundaries

The client should do as little product logic as possible. It captures audio, handles UI state, and relays events. It should not decide authorization or business outcomes. For every voice session, I generate a short-lived token on the backend, scoped to a single room and participant role. That keeps room access controlled and avoids broad credentials leaking into the frontend.

I also keep a server-side session record keyed by a stable session id. That record tracks user id (if authenticated), room id, started timestamp, current mode, and latest orchestration state. When a user reconnects, the backend can recover context without guessing from client memory.

LiveKit room design decisions

LiveKit gives you the real-time substrate, but you still need conventions. I prefer one room per active user session for most assistant experiences. It keeps event scopes clear and minimizes accidental cross-talk. If your use case requires multiple participants (agent + user + supervisor), define explicit participant metadata and role handling early.

Two patterns help a lot:

Use data channels for control events (interrupt, mute, handoff, system status), not just media streams.
Treat room events as first-class telemetry. Join/leave, track publish/unpublish, reconnect, and bitrate drops are all product signals, not just infra logs.

That event data becomes crucial when someone says, "the assistant started speaking over me" and you need to know whether it was model latency, VAD timing, or a reconnect edge case.

STT: partials, finals, and confidence handling

Speech-to-text should emit partial transcripts quickly for responsiveness, but downstream business logic should usually wait for final segments or confidence thresholds. If you run every partial through your orchestration loop, you create race conditions and noisy model calls.

I normally implement a transcript assembler with explicit states:

partial: render to UI, do not commit to durable context.
final: append to conversation history and trigger orchestration.
revised final: patch prior segment if provider corrects recognition.

This makes transcript behavior deterministic and easier to test. It also avoids subtle bugs where the assistant answers a phrase the user never actually said in the final transcript.

LLM orchestration in FastAPI

For orchestration, I usually expose a FastAPI endpoint or event handler that receives normalized transcript events and returns structured actions rather than raw prose. The action envelope might include:

assistant text
tool calls
state updates
UI directives (for example, ask follow-up, confirm action, escalate)

When teams skip this structure, the orchestration layer turns into prompt glue and ad-hoc branching. I prefer strict schemas, even if the model is flexible. With schema-first orchestration, you can validate outputs, reject malformed actions, and retry safely without duplicating side effects.

This is where FastAPI shines for me: clear request models, async handling, and a straightforward way to compose tool integrations while keeping the contract explicit.

Latency budgets and interruption behavior

Voice UX is mostly latency UX. If a response arrives late, users interrupt. If interruption handling is weak, trust drops quickly. I set a practical latency budget per turn and break it down by stage: STT, orchestration, tool calls, and TTS startup. Once you have per-stage timings, optimization decisions become obvious.

Interruption should be supported as a first-class control flow:

client sends interrupt event immediately
current TTS stream is cancelled
orchestrator marks prior response as interrupted
next turn continues from clean state

Without explicit cancellation semantics, your system often continues generating text in the background and then leaks stale context into the next turn.

Retries and idempotency

Retries are inevitable with external STT, LLM, and TTS providers. The critical point is making retries safe. I attach idempotency keys to orchestration turns and tool executions, then persist turn state transitions: received, processing, completed, failed, cancelled.

If a timeout occurs and the client retries, the backend can return existing results or resume from a known step instead of replaying side effects. This matters a lot when tool calls trigger real actions like booking, messaging, or account changes.

Observability that actually helps

For voice systems, aggregate uptime is not enough. I track:

end-to-end turn latency percentiles
time to first transcript token
time to first audio byte from TTS
interrupt rate per session
provider error codes by stage
reconnect frequency and average recovery time

I also keep structured logs keyed by session id and turn id, so I can reconstruct a problematic interaction quickly. If you can replay timeline events (not necessarily raw audio), debugging becomes dramatically faster and safer from a privacy perspective.

Deployment and scaling notes

On deployment, I treat the voice path as a latency-sensitive service tier. Keep orchestration workers close to your media region where possible, and avoid unnecessary synchronous hops. FastAPI services running behind autoscaling can work well, but cold starts and noisy neighbors still matter for interactive voice.

I generally separate concerns into at least two deployables: API/session control and orchestration workers. This gives flexibility to scale orchestration independently when usage spikes. If you rely on Kubernetes, add readiness checks that validate downstream dependency health rather than only process liveness.

Security and privacy baseline

Voice products can capture sensitive data by accident. I try to apply a conservative baseline:

short retention for raw transcripts unless explicitly needed
redaction for known sensitive fields in logs
scoped credentials for room tokens and provider APIs
clear user controls for muting, consent, and deletion

Even if your first release is small, setting these boundaries early prevents painful cleanup when the product grows.

Closing thoughts

Building real-time voice AI applications is not only an LLM problem. It is a systems problem spanning networking, state management, retries, and product interaction design. LiveKit and FastAPI make an effective foundation, but the quality comes from how you define boundaries and failure behavior.

For me, the winning pattern is simple: predictable contracts, explicit state, tight latency feedback loops, and observability from day one. That keeps the experience feeling conversational while still operating like production software.