AI latency is the time between the end of a user's input and the start of the AI's response. In voice AI, this is measured from when you finish speaking to when the first audio byte plays back. Latency is the most important usability metric in conversational AI — it determines whether talking to AI feels like a real conversation or like waiting on hold.
Try Lucy OS1 →End-to-end AI latency has four components: (1) STT latency — the time to transcribe your speech; (2) network latency — round-trip time to API servers; (3) LLM inference latency — time for the model to generate a response; (4) TTS latency — time for audio synthesis to begin streaming. Modern systems optimise all four in parallel using streaming architectures.
Lucy OS1 targets sub-500ms end-to-end latency through a fully streamed pipeline. Deepgram returns transcripts in real time, GPT-4o-mini starts inferring before you have finished speaking, and Cartesia begins audio delivery within 150ms of the first token. Total latency in good network conditions: 400-600ms.
Try Lucy OS1 →Deepgram nova-3 returns streaming partial transcripts within 100ms of audio input, enabling the LLM to begin processing before speech ends.
GPT-4o-mini generates first tokens in 200-400ms. Token streaming means TTS can begin before the full response is generated.
Cartesia Sonic-2 delivers first audio within 150-200ms of receiving text. The brain perceives a conversation start, not a wait.
Physical distance to API servers adds 20-80ms for most users. Edge infrastructure and CDN-hosted TTS help minimise this component.
What is a good latency for voice AI?
Under 500ms end-to-end is generally considered the threshold for 'real-time' conversation. Below 300ms is exceptional. Above 1000ms breaks conversational flow for most users.
Why does ChatGPT voice mode feel slow sometimes?
ChatGPT's voice mode uses server-side turn detection, which adds up to 500ms of silence before processing begins. It also uses a single-model pipeline rather than a optimised three-layer architecture.
Does model size affect latency?
Smaller models (GPT-4o-mini, Claude Haiku) generate tokens 2-3x faster than frontier models. For conversational use, the difference in output quality is minimal but the latency difference is significant.
How does audio streaming reduce perceived latency?
Without streaming, users wait for the entire response to be synthesised before any audio plays. With streaming, the first audio chunk plays within 200ms and subsequent chunks arrive while earlier ones are playing — making the total wait imperceptible.
Lucy OS1 puts these concepts to work in a real, streaming voice AI pipeline — Deepgram STT, GPT-4o-mini, and Cartesia TTS delivering natural voice conversation.
Start talking to Lucy →Welcome