Voice AI Glossary · 2026

What Is Voice Activity Detection (VAD)?

Voice activity detection (VAD) is the process of automatically detecting when a person starts and stops speaking in an audio stream. In voice AI, VAD determines when to end turn detection, sending the audio to the STT system and triggering the AI's response.

Try Lucy OS1 →

Definition in Full

VAD is a deceptively important component of conversational AI. A VAD that triggers too early cuts the user off mid-sentence; one that triggers too late adds lag to every conversation turn. The threshold is typically set as a silence duration after speech ends (e.g., 800-1200ms of silence signals turn completion). Modern neural VAD systems also distinguish speech from background noise, music, and environmental sounds.

How Lucy OS1 Uses Voice Activity Detection (VAD)

Lucy OS1 uses Deepgram's built-in VAD with an utterance-end threshold tuned for conversational use. The endpointing is set at 350ms, fast enough to feel responsive but long enough not to cut off natural mid-thought pauses.

Try Lucy OS1 →

Key Concepts

Silence threshold

The duration of silence after speech that triggers VAD. Lower values mean faster response but higher risk of cutting the user off. Typically 300-1200ms.

Energy-based vs neural VAD

Simple VAD detects drops in audio energy (volume). Neural VAD uses ML models to distinguish speech from noise, music, and other sounds, far more accurate in real environments.

Endpointing

The process of detecting the end of a speech utterance to pass it to the STT system. Endpointing is the application-level implementation of VAD.

Barge-in detection

Detecting when a user starts speaking while the AI is still talking, allowing the user to interrupt. Essential for natural conversation and one of the harder VAD problems.

Frequently Asked Questions

Why does my AI sometimes cut me off?

This is a VAD false positive, the system detected a pause long enough to trigger turn end while you were still thinking. The fix is increasing the silence threshold, which trades off against response latency.

Can I interrupt AI when it is speaking?

If the system supports barge-in detection, yes. Lucy OS1 allows you to speak while Lucy is responding, which cancels the current TTS and begins processing your new input.

How does VAD handle noisy environments?

Neural VAD models are trained to distinguish speech from noise. In very loud environments, accuracy degrades, the system may fire on loud background sounds or miss quiet speech.

Experience Voice Activity Detection (VAD) in Action

Lucy OS1 puts these concepts to work in a real, streaming voice AI pipeline: Natural Voice Recognition, Natural Voice Intelligence, and Natural Voice Synthesis delivering sub-500ms voice conversation.

Start talking to Lucy →