← Back to glossary
+Suggest a term
Concept·AI Models & Capabilities·Added 1 day ago

Voice cloning

Also known as: AI voice cloning, voice replication, voice synthesis, digital voice twin

Training an AI model to reproduce a specific person's voice from audio samples. The result is a synthetic version of that voice that can speak any text. Used for content localization, brand voice, accessibility, and entertainment, but also one of the most actively misused AI capabilities.

Voice cloning takes text-to-speech a step further: instead of a generic synthesized voice, you replicate a specific person's vocal identity. The process involves providing audio samples of the target voice (anywhere from a minute for quick clones to 30+ minutes for high-fidelity professional clones), which the model uses to extract vocal characteristics like timbre, pitch, accent, speaking rhythm, and emotional expression. The resulting model can speak arbitrary new text in that voice.

ElevenLabs offers two tiers: Instant Voice Cloning (quick, from about one minute of audio) and Professional Voice Cloning (higher fidelity, requires 30+ minutes of high-quality audio). Both require explicit consent verification. OpenAI, Resemble AI, and others offer similar capabilities through APIs. The technology is good enough now that short-form clones can fool listeners in many contexts, while Professional clones are used in commercial production for dubbing films into multiple languages while preserving the original actor's voice.

Voice cloning is also the most obvious misuse vector among AI audio capabilities. Deepfake audio impersonating public figures, executives, or family members in fraud schemes is a documented problem. Responsible providers require consent verification and deploy audio classifiers to detect cloned voices. Builders integrating voice cloning into products should design consent flows carefully and be aware that regulations around synthetic media are evolving in many jurisdictions.

This definition is AI-generated and refreshed weekly. It may contain inaccuracies. Use your own judgment, especially for production decisions.
Related terms
Text-to-speechElevenLabsAI avatarMusic generation