Overview
XTTSTTSService provides multilingual voice synthesis with voice cloning capabilities through a locally hosted streaming server. The service supports real-time streaming and custom voice training using Coqui’s XTTS-v2 model for cross-lingual text-to-speech.
XTTS API Reference
Pipecat’s API methods for XTTS integration
Example Implementation
Complete example with voice cloning
XTTS Repository
Official XTTS streaming server repository
Voice Cloning
Learn about custom voice training
Installation
XTTS requires a running streaming server. Start the server using Docker:Prerequisites
XTTS Server Setup
Before using XTTSTTSService, you need:- Docker Environment: Set up Docker with GPU support for optimal performance
- XTTS Server: Run the XTTS streaming server container
- Voice Models: Configure voice models and cloning samples as needed
Required Configuration
- Server URL: Configure the XTTS server endpoint (default:
http://localhost:8000) - Voice Selection: Set up voice models or voice cloning samples
GPU acceleration is recommended for optimal performance. The server requires
CUDA support for best results.
Configuration
XTTSService
ID of the studio speaker to use for synthesis.
Base URL of the XTTS streaming server (e.g.
http://localhost:8000).An aiohttp session for HTTP requests to the XTTS server.
Language for synthesis. Supports Czech, German, English, Spanish, French, Hindi, Hungarian, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Russian, Turkish, and Chinese.
Output audio sample rate in Hz. When
None, uses the pipeline’s configured sample rate. Audio is automatically resampled from XTTS’s native 24kHz output.Usage
Basic Setup
With Language Configuration
Notes
- Local server required: XTTS requires a locally running streaming server (via Docker). The service connects to this server over HTTP.
- Studio speakers: On startup, the service fetches available “studio speakers” from the server’s
/studio_speakersendpoint. Thevoice_idmust match one of these speakers. - Audio resampling: XTTS natively outputs audio at 24kHz. The service automatically resamples to match the pipeline’s configured sample rate.
- GPU recommended: The XTTS server performs best with CUDA-enabled GPU acceleration. CPU inference is significantly slower.
- No API key required: XTTS runs locally, so no external API credentials are needed.