Voice AI Glossary · 2026

What Is Text-to-Speech?

Text-to-speech (TTS) is the technology that converts written text into spoken audio. In voice AI, TTS is the final layer, the AI generates a text response, and TTS turns it into a voice you can hear.

Try Lucy OS1 →

Definition in Full

Early TTS systems concatenated pre-recorded phonemes and sounded robotic. Modern neural TTS systems are trained on thousands of hours of human speech and produce output indistinguishable from a real person in many cases. Quality dimensions include naturalness (does it sound human?), prosody (does it stress words correctly?), and latency (how long before audio starts playing?).

How Lucy OS1 Uses Text-to-Speech (TTS)

Lucy OS1 uses Cartesia's Sonic-2 neural TTS model with the 'Cathy' voice, warm, natural, and conversational. The voice streams in chunks so the first audio plays within 200ms of the LLM finishing its response, creating a seamless back-and-forth conversation.

Try Lucy OS1 →

Key Concepts

Neural TTS vs concatenative TTS

Neural TTS models generate audio from scratch using deep learning; they sound natural and handle novel sentences well. Concatenative TTS stitches recorded audio clips, faster but robotic.

Prosody and emphasis

Prosody is the rhythm, stress, and intonation of speech. Good TTS knows to emphasise 'not' in 'that is not correct' and to rise at the end of questions.

Streaming audio output

Streaming TTS begins playing audio before the entire response is synthesised. This reduces perceived latency from 2-3 seconds to under 500ms in real-time systems.

Voice cloning

Some TTS systems can clone a speaker's voice from a short sample. Cartesia supports voice cloning for custom deployments, though Lucy OS1 uses a fixed, professionally designed voice.

Frequently Asked Questions

What is the most natural-sounding text-to-speech in 2026?

ElevenLabs and Cartesia are widely considered the most natural-sounding neural TTS providers. Cartesia's Sonic model has particularly low latency, making it ideal for real-time voice AI.

How does TTS latency affect AI conversations?

TTS latency is the delay between the AI finishing text generation and the first audio byte playing. In conversation, humans notice gaps above 300ms. Cartesia Sonic streaming achieves first-audio latency under 200ms.

Can TTS models speak in multiple languages?

Yes. Modern TTS models like Cartesia Sonic-2 support multiple languages including English, French, Spanish, German, Italian, Portuguese, Japanese, Chinese, and more.

Why does AI voice still sometimes sound unnatural?

The most common artifacts are: incorrect stress on homographs (words spelled the same but pronounced differently), flat intonation on long sentences, and mishandled punctuation. These are active areas of research.

Is text-to-speech the same as voice synthesis?

Yes, they are used interchangeably. Voice synthesis is the broader term; TTS specifically refers to converting text input to audio output.