Voice activity detection (VAD) is the process of automatically detecting when a person starts and stops speaking in an audio stream. In voice AI, VAD determines when to end turn detection — sending the audio to the STT system and triggering the AI's response.
Try Lucy OS1 →VAD is a deceptively important component of conversational AI. A VAD that triggers too early cuts the user off mid-sentence; one that triggers too late adds lag to every conversation turn. The threshold is typically set as a silence duration after speech ends (e.g., 800-1200ms of silence signals turn completion). Modern neural VAD systems also distinguish speech from background noise, music, and environmental sounds.
Lucy OS1 uses Deepgram's built-in VAD with an utterance-end threshold tuned for conversational use. The endpointing is set at 350ms — fast enough to feel responsive but long enough not to cut off natural mid-thought pauses.
Try Lucy OS1 →The duration of silence after speech that triggers VAD. Lower values mean faster response but higher risk of cutting the user off. Typically 300-1200ms.
Simple VAD detects drops in audio energy (volume). Neural VAD uses ML models to distinguish speech from noise, music, and other sounds — far more accurate in real environments.
The process of detecting the end of a speech utterance to pass it to the STT system. Endpointing is the application-level implementation of VAD.
Detecting when a user starts speaking while the AI is still talking — allowing the user to interrupt. Essential for natural conversation and one of the harder VAD problems.
Why does my AI sometimes cut me off?
This is a VAD false positive — the system detected a pause long enough to trigger turn end while you were still thinking. The fix is increasing the silence threshold, which trades off against response latency.
Can I interrupt AI when it is speaking?
If the system supports barge-in detection, yes. Lucy OS1 allows you to speak while Lucy is responding, which cancels the current TTS and begins processing your new input.
How does VAD handle noisy environments?
Neural VAD models are trained to distinguish speech from noise. In very loud environments, accuracy degrades — the system may fire on loud background sounds or miss quiet speech.
Lucy OS1 puts these concepts to work in a real, streaming voice AI pipeline — Deepgram STT, GPT-4o-mini, and Cartesia TTS delivering natural voice conversation.
Start talking to Lucy →Welcome