uncloseai.

XTTS v2

Self-host — Not on the public endpoint. Clone the repo to use this engine.

What It Does

This is the high-fidelity voice cloning engine. Give it a 6-second audio sample of anyone's voice, and it will speak in that voice across 16 languages. It automatically detects the language of your input text.

Built on the Coqui TTS project — one of the most important open-source TTS efforts that was abandoned when the company shut down. We rescued it. The raccoons dug deep for this one.

Best for audiobooks, character voices, personalized assistants, or any time you want the output to sound like a specific person. Requires a GPU with about 4GB of VRAM.

Example

Once self-hosted and enabled, it works through the same OpenAI-compatible API:

from openai import OpenAI

client = OpenAI(
    api_key="not-needed",
    base_url="http://localhost:8000/v1"
)

# HD cloning uses tts-1-hd model
client.audio.speech.create(
    model="tts-1-hd",
    voice="my_voice",
    input="This voice was cloned from a six-second recording. The original Coqui project may be gone, but the raccoons kept it alive."
).stream_to_file("hd-cloned.mp3")

Clone Your Own Voice

# Record 10 seconds of clean audio
ffmpeg -f alsa -i default -ac 1 -ar 22050 -t 10 -y my_voice.wav

# Clean up background noise
ffmpeg -i my_voice.wav \
  -af "highpass=f=200, lowpass=f=3000, afftdn=nf=25" \
  -ac 1 -ar 22050 my_voice_clean.wav

# Copy to voices directory
cp my_voice_clean.wav ~/uncloseai-speech/voices/me.wav

Then add it to config/voice_to_speaker.yaml:

tts-1-hd:
  my_voice:
    model: xtts
    speaker: voices/me.wav
    language: en

Technical Details

Languages: 16 with automatic detection
Voice cloning: From 6-second audio samples
Hardware: GPU required (~4GB VRAM)
Upstream: XTTS v2 / Coqui TTS (community-maintained)

Self-Hosting

git clone https://git.unturf.com/engineering/unturf/uncloseai-speech.git
cd uncloseai-speech
make deploy
make voices-xtts

Uncomment the tts-1-hd section in voice_to_speaker.default.yaml and restart.

← Back to Text-to-Speech overview