Voice cloning
Also known as: AI voice cloning, voice replication, voice synthesis, digital voice twin
Voice cloning takes text-to-speech a step further: instead of a generic synthesized voice, you replicate a specific person's vocal identity. The process involves providing audio samples of the target voice (anywhere from a minute for quick clones to 30+ minutes for high-fidelity professional clones), which the model uses to extract vocal characteristics like timbre, pitch, accent, speaking rhythm, and emotional expression. The resulting model can speak arbitrary new text in that voice.
ElevenLabs offers two tiers: Instant Voice Cloning (quick, from about one minute of audio) and Professional Voice Cloning (higher fidelity, requires 30+ minutes of high-quality audio). Both require explicit consent verification. OpenAI, Resemble AI, and others offer similar capabilities through APIs. The technology is good enough now that short-form clones can fool listeners in many contexts, while Professional clones are used in commercial production for dubbing films into multiple languages while preserving the original actor's voice.
Voice cloning is also the most obvious misuse vector among AI audio capabilities. Deepfake audio impersonating public figures, executives, or family members in fraud schemes is a documented problem. Responsible providers require consent verification and deploy audio classifiers to detect cloned voices. Builders integrating voice cloning into products should design consent flows carefully and be aware that regulations around synthetic media are evolving in many jurisdictions.