Voice AI Glossary · 2026

What Is Speech-to-Text?

Speech-to-text (STT) is the technology that converts spoken audio into written text. It is the first layer in any voice AI system, before a language model can respond, it needs to read your words as text.

Try Lucy OS1 →

Definition in Full

STT systems work by analysing acoustic signals, the raw soundwaves from your microphone, and matching phonemes (sound units) to word sequences. Modern deep-learning STT systems like Deepgram nova-3 and Whisper process audio in near real time, returning transcripts faster than a human could type them. Latency in STT is critical: a 500ms transcription delay adds a full second of perceived lag to a conversation.

How Lucy OS1 Uses Speech-to-Text (STT)

Lucy OS1 uses Deepgram nova-3 for speech recognition, one of the lowest-latency, highest-accuracy STT systems available. It processes audio at 16kHz with streaming results, so Lucy starts responding while you are still speaking rather than waiting for a silence gap.

Try Lucy OS1 →

Key Concepts

Streaming vs batch transcription

Batch STT waits for you to finish speaking before processing. Streaming STT returns partial results in real time, making conversations feel natural rather than halting.

Word error rate (WER)

The standard accuracy metric for STT. A WER of 5% means 5 errors per 100 words. Leading systems achieve 2-5% WER in clean audio.

VAD (Voice Activity Detection)

STT systems use VAD to detect when you have stopped speaking. A VAD that triggers too early cuts you off; too late adds lag. Tuning VAD thresholds is key to natural conversation.

Language detection

Advanced STT systems automatically detect which language is being spoken and switch transcription models accordingly, enabling true multilingual conversation.

Frequently Asked Questions

What is the most accurate speech-to-text model in 2026?

Deepgram nova-3 and OpenAI Whisper Large v3 are among the most accurate for English. Deepgram has the edge on real-time latency; Whisper has strong multilingual coverage.

How does background noise affect speech-to-text accuracy?

Noise significantly reduces accuracy. Most STT systems perform well in quiet environments but degrade in cars, open offices, or outdoors. Noise cancellation at the microphone level helps considerably.

Can speech-to-text transcribe multiple speakers?

Yes, this is called speaker diarization. Systems like Deepgram can label which speaker said what, useful for meeting transcription. Lucy OS1 does not use diarization since it is designed for single-user conversations.

What is real-time vs asynchronous speech-to-text?

Real-time STT transcribes as you speak, enabling live conversation. Asynchronous STT processes a recorded file after the fact, useful for meeting notes but not for live AI voice conversation.

Is speech-to-text the same as voice recognition?

They are often used interchangeably, but technically speech recognition converts audio to text while voice recognition identifies whose voice it is (speaker identification). Lucy OS1 uses speech recognition, not speaker identification.