Concept·AI Models & Capabilities·Added 1 month ago

Text-to-speech

Also known as: TTS, speech synthesis, AI voice generation, AI narration

Converting written text into spoken audio using AI. Modern TTS produces speech that sounds natural and emotionally aware, not robotic. Used for voiceovers, audiobooks, podcast narration, accessibility tools, voice agents, and any product that needs to talk.

Text-to-speech has existed for decades, but the AI-era version is qualitatively different from the robotic-sounding TTS of earlier systems. Modern models like ElevenLabs' Eleven v3 or Multilingual v2 produce speech with nuanced intonation, appropriate pacing, emotional inflection cued by the text content, and the ability to handle complex pronunciation, accents, and multilingual content. The shift happened around 2022 to 2023 when deep learning models replaced the older concatenative and parametric synthesis approaches.

The key parameters builders work with are: model choice (quality vs. latency tradeoff), voice selection (from prebuilt libraries of thousands of voices or custom-cloned voices), speed and stability settings, and language. For real-time conversational applications, latency matters enormously: a TTS model with 75ms to produce the first audio chunk (time to first token, the audio equivalent) feels live; a 500ms model feels laggy. ElevenLabs Flash v2.5, PlayHT, and OpenAI's TTS models are the primary options at the low-latency end.

TTS is now embedded in a huge range of products: customer service voice bots, e-learning narration, accessibility tools that read content aloud, podcast generation tools, video voiceover workflows, and game character dialogue. The cost model is typically per character generated, which makes it easy to reason about at scale. The main creative constraint is that TTS is still a text-in, audio-out pipeline: if you want full musical production or highly emotional performance, you're in voice cloning or AI music generation territory.

This definition is AI-generated and refreshed weekly. It may contain inaccuracies. Use your own judgment, especially for production decisions.

Related terms