Voice AI Glossary · 2026

What Is AI Latency?

AI latency is the time between the end of a user's input and the start of the AI's response. In voice AI, this is measured from when you finish speaking to when the first audio byte plays back. Latency is the most important usability metric in conversational AI, it determines whether talking to AI feels like a real conversation or like waiting on hold.

Try Lucy OS1 →

Definition in Full

End-to-end AI latency has four components: (1) STT latency, the time to transcribe your speech; (2) network latency, round-trip time to API servers; (3) LLM inference latency, time for the model to generate a response; (4) TTS latency, time for audio synthesis to begin streaming. Modern systems optimise all four in parallel using streaming architectures.

How Lucy OS1 Uses AI Latency

Lucy OS1 targets sub-500ms end-to-end latency through a fully streamed pipeline. Deepgram returns transcripts in real time, GPT-4o-mini starts inferring before you have finished speaking, and Cartesia begins audio delivery within 150ms of the first token. Total latency in good network conditions: 400-600ms.

Try Lucy OS1 →

Key Concepts

STT latency component

Deepgram nova-3 returns streaming partial transcripts within 100ms of audio input, enabling the LLM to begin processing before speech ends.

LLM inference latency

GPT-4o-mini generates first tokens in 200-400ms. Token streaming means TTS can begin before the full response is generated.

TTS time-to-first-audio

Cartesia Sonic-2 delivers first audio within 150-200ms of receiving text. The brain perceives a conversation start, not a wait.

Network latency

Physical distance to API servers adds 20-80ms for most users. Edge infrastructure and CDN-hosted TTS help minimise this component.

Frequently Asked Questions

What is a good latency for voice AI?

Under 500ms end-to-end is generally considered the threshold for 'real-time' conversation. Below 300ms is exceptional. Above 1000ms breaks conversational flow for most users.

Why does ChatGPT voice mode feel slow sometimes?

ChatGPT's voice mode uses server-side turn detection, which adds up to 500ms of silence before processing begins. It also uses a single-model pipeline rather than a optimised three-layer architecture.

Does model size affect latency?

Smaller models (GPT-4o-mini, Claude Haiku) generate tokens 2-3x faster than frontier models. For conversational use, the difference in output quality is minimal but the latency difference is significant.

How does audio streaming reduce perceived latency?

Without streaming, users wait for the entire response to be synthesised before any audio plays. With streaming, the first audio chunk plays within 200ms and subsequent chunks arrive while earlier ones are playing, making the total wait imperceptible.

Experience AI Latency in Action

Lucy OS1 puts these concepts to work in a real, streaming voice AI pipeline: Natural Voice Recognition, Natural Voice Intelligence, and Natural Voice Synthesis delivering sub-500ms voice conversation.

Start talking to Lucy →