Speech-to-text (STT) is the technology that converts spoken audio into written text. It is the first layer in any voice AI system — before a language model can respond, it needs to read your words as text.
Try Lucy OS1 →STT systems work by analysing acoustic signals — the raw soundwaves from your microphone — and matching phonemes (sound units) to word sequences. Modern deep-learning STT systems like Deepgram nova-3 and Whisper process audio in near real time, returning transcripts faster than a human could type them. Latency in STT is critical: a 500ms transcription delay adds a full second of perceived lag to a conversation.
Lucy OS1 uses Deepgram nova-3 for speech recognition — one of the lowest-latency, highest-accuracy STT systems available. It processes audio at 16kHz with streaming results, so Lucy starts responding while you are still speaking rather than waiting for a silence gap.
Try Lucy OS1 →Batch STT waits for you to finish speaking before processing. Streaming STT returns partial results in real time, making conversations feel natural rather than halting.
The standard accuracy metric for STT. A WER of 5% means 5 errors per 100 words. Leading systems achieve 2-5% WER in clean audio.
STT systems use VAD to detect when you have stopped speaking. A VAD that triggers too early cuts you off; too late adds lag. Tuning VAD thresholds is key to natural conversation.
Advanced STT systems automatically detect which language is being spoken and switch transcription models accordingly — enabling true multilingual conversation.
What is the most accurate speech-to-text model in 2026?
Deepgram nova-3 and OpenAI Whisper Large v3 are among the most accurate for English. Deepgram has the edge on real-time latency; Whisper has strong multilingual coverage.
How does background noise affect speech-to-text accuracy?
Noise significantly reduces accuracy. Most STT systems perform well in quiet environments but degrade in cars, open offices, or outdoors. Noise cancellation at the microphone level helps considerably.
Can speech-to-text transcribe multiple speakers?
Yes — this is called speaker diarization. Systems like Deepgram can label which speaker said what, useful for meeting transcription. Lucy OS1 does not use diarization since it is designed for single-user conversations.
What is real-time vs asynchronous speech-to-text?
Real-time STT transcribes as you speak, enabling live conversation. Asynchronous STT processes a recorded file after the fact — useful for meeting notes but not for live AI voice conversation.
Is speech-to-text the same as voice recognition?
They are often used interchangeably, but technically speech recognition converts audio to text while voice recognition identifies whose voice it is (speaker identification). Lucy OS1 uses speech recognition, not speaker identification.
Lucy OS1 puts these concepts to work in a real, streaming voice AI pipeline — Deepgram STT, GPT-4o-mini, and Cartesia TTS delivering natural voice conversation.
Start talking to Lucy →Welcome