Voice AI Architecture: Why WebRTC is Becoming the Standard
Every major voice AI platform is converging on WebRTC. Here's why, and what it means for developers.
Something interesting is happening in voice AI infrastructure. If you've been building conversational AI applications, you've probably noticed: every major platform is converging on WebRTC.
ElevenLabs added WebRTC support to their Conversational AI platform. OpenAI's Realtime API uses WebRTC as its recommended transport. LiveKit built their entire Agents framework around it. Pipecat, the open-source voice AI framework, ships with WebRTC transports. Even newer entrants like LLMRTC are WebRTC-native from day one.
This isn't a coincidence. It's the industry recognizing what telephony engineers have known for decades: real-time voice is a solved problem, and the solution is WebRTC.
The Latency Imperative
Human conversation operates on tight timing constraints. Research shows the ideal turn-taking delay is around 200ms. Delays over 500ms feel unnatural. Beyond 800ms, conversations start to break down—users repeat themselves, talk over the AI, or abandon the interaction entirely.
A typical voice AI pipeline looks like this:
User Speech → STT (200ms) → LLM (300ms) → TTS (200ms) → Response That's already 700ms before you add network latency. Every millisecond in your transport layer eats into your latency budget.
WebRTC was designed for exactly this problem. It's not just "low latency"—it's optimized for real-time media in ways that HTTP and WebSockets simply aren't.
Why WebRTC Wins for Voice AI
1. Browser-Native Audio Processing
WebRTC includes battle-tested audio processing that's been refined across billions of video calls:
- Acoustic Echo Cancellation (AEC): Removes the AI's voice from the microphone input, preventing feedback loops
- Automatic Gain Control (AGC): Normalizes volume levels across different microphones and environments
- Noise Suppression: Filters background noise without degrading speech quality
Building this yourself is months of work. With WebRTC, it's automatic. ElevenLabs explicitly calls this out—their WebRTC integration unlocks "best-in-class echo cancellation and background noise removal" that wasn't possible with their WebSocket-based approach.
2. Designed for Unreliable Networks
WebRTC uses UDP with custom reliability layers, making intelligent tradeoffs that HTTP can't:
- Prioritizes low latency over perfect delivery
- Implements jitter buffers to smooth out network variance
- Handles packet loss gracefully for audio (interpolation vs. retransmission)
- Adapts bitrate dynamically based on network conditions
For voice, a slightly degraded audio frame delivered on time beats a perfect frame delivered late.
3. Zero Plugin Installation
WebRTC works natively in every modern browser. No plugins, no downloads, no user friction. For voice AI applications targeting end users, this is table stakes.
4. Peer-to-Peer Capability
While most voice AI architectures use a server-mediated model, WebRTC's P2P capability opens interesting possibilities for edge-deployed models and reduced infrastructure costs.
The Standard Architecture
Despite different branding, most voice AI platforms have converged on a similar architecture:
┌─────────────────────────────────────────────────────────────┐
│ Browser │
│ ┌─────────┐ ┌─────────┐ │
│ │ Mic │ │ Speaker │ │
│ └────┬────┘ └────▲────┘ │
│ │ │ │
│ ▼ │ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ WebRTC Audio Pipeline │ │
│ │ (AEC, AGC, Noise Suppression, VAD) │ │
│ └───────────────────┬─────────────────────────────────┘ │
└──────────────────────┼───────────────────────────────────────┘
│ WebRTC
▼
┌─────────────────────────────────────────────────────────────┐
│ Voice AI Server │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ STT │ ──▶ │ LLM │ ──▶ │ TTS │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ ┌─────────────────────────────────────────────┐ │
│ │ Turn Detection / Interruption │ │
│ └─────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘ The key components:
- WebRTC Transport: Handles audio streaming with built-in echo cancellation
- Voice Activity Detection (VAD): Determines when the user is speaking
- Speech-to-Text (STT): Streaming transcription (Deepgram, AssemblyAI, Whisper)
- LLM: Generates responses (GPT-4, Claude, Llama)
- Text-to-Speech (TTS): Synthesizes audio response (ElevenLabs, Cartesia, PlayHT)
- Turn Detection: Determines when to start/stop speaking, handles interruptions
The Key Players
LiveKit Agents
LiveKit has emerged as the infrastructure layer for production voice AI. Their Agents framework treats AI as a first-class WebRTC participant—the agent joins the same "room" as the user with full access to session state.
from livekit.agents import AutoSubscribe, JobContext, llm, WorkerOptions
from livekit.agents.voice_assistant import VoiceAssistant
from livekit.plugins import openai, silero, deepgram, cartesia
async def entrypoint(ctx: JobContext):
await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
assistant = VoiceAssistant(
vad=silero.VAD.load(),
stt=deepgram.STT(),
llm=openai.LLM(),
tts=cartesia.TTS(),
)
await assistant.start(ctx.room) Their worker-job architecture provides session isolation and horizontal scaling—if one agent crashes, others are unaffected.
ElevenLabs Conversational AI
ElevenLabs recently added WebRTC as an alternative to their WebSocket transport:
import { useConversation } from '@elevenlabs/react';
const conversation = useConversation();
await conversation.startSession({
agentId: 'your-agent-id',
connectionType: 'webrtc', // vs 'websocket'
}); The difference is immediate: WebRTC provides superior echo cancellation that eliminates audio feedback issues common in speaker-to-microphone setups.
OpenAI Realtime API
OpenAI's Realtime API supports WebRTC natively. From their documentation:
In most cases, use the WebRTC API for real-time audio streaming. WebRTC is designed to minimize delay, making it more suitable for audio and video communication where low latency is critical.
Their benchmarks show first partial text responses in 150-250ms and first audible synthesized phonemes in 220-400ms with WebRTC transport—fast enough for natural-feeling dialogue.
Pipecat
Pipecat, built by Daily, is the open-source option. It's framework-agnostic and supports multiple transports:
from pipecat.transports.network.small_webrtc import SmallWebRTCTransport
transport = SmallWebRTCTransport(
host="0.0.0.0",
port=7860,
) Their SmallWebRTC transport is fully P2P with end-to-end encryption—great for privacy-sensitive applications.
LLMRTC
LLMRTC is a newer TypeScript-native option that's provider-agnostic:
import { VoiceAgent } from '@llmrtc/core';
const agent = new VoiceAgent({
stt: { provider: 'deepgram' },
llm: { provider: 'openai', model: 'gpt-4o' },
tts: { provider: 'elevenlabs' },
});
await agent.connect(); It supports running entirely locally with Ollama, Faster-Whisper, and Piper—no cloud dependencies.
The Observability Gap
Here's the challenge: WebRTC's strengths create observability blind spots.
WebRTC enables peer-to-peer interactions where media often doesn't flow through servers you control. All media traffic is encrypted with SRTP, limiting packet inspection. Traditional monitoring tools weren't built for real-time media sessions with sub-second quality requirements.
Voice AI compounds this complexity. You're not just monitoring a WebRTC session—you're monitoring a pipeline where latency can originate from:
- Network conditions (jitter, packet loss)
- STT service response time
- LLM token generation
- TTS synthesis
- Turn detection decisions
Without proper instrumentation, debugging becomes guesswork. Was that awkward pause caused by network jitter or LLM latency? Did the interruption fail because of VAD sensitivity or audio echo leakage?
The metrics that matter for voice AI observability:
| Metric | Why It Matters |
|---|---|
| Time to First Byte (audio) | User perception of responsiveness |
| End-to-end latency | Total user-speech to AI-speech time |
| Interruption success rate | Natural conversation feel |
| STT/LLM/TTS breakdown | Identify pipeline bottlenecks |
| Audio quality (MOS) | User experience quality |
| Turn detection accuracy | Conversation flow |
WebRTC's getStats() API provides client-side metrics, but correlating these with server-side pipeline telemetry requires purpose-built tooling.
What This Means for Developers
If you're building voice AI applications today:
- Start with WebRTC: Don't fight the convergence. Every major platform supports it, and the audio processing benefits are significant.
- Choose your abstraction level:
- Want full control? Use Pipecat or LLMRTC
- Want managed infrastructure? Use LiveKit
- Want turnkey? Use ElevenLabs or OpenAI Realtime directly
- Instrument from day one: Voice AI is latency-sensitive enough that you need visibility into every pipeline stage. Build observability in early.
- Test with real audio conditions: Synthetic tests don't catch echo cancellation issues, background noise handling, or interruption edge cases. Test with actual microphones and speakers.
The WebRTC convergence is good news for developers—it means portable skills, interoperable tools, and a shared understanding of best practices. The stack is maturing, and production-grade voice AI is now achievable without building everything from scratch.