Skip to main content
Coqui, the XTTS maintainer, has shut down. XTTS may not receive future updates or support.

Overview

XTTSTTSService provides multilingual voice synthesis with voice cloning capabilities through a locally hosted streaming server. The service supports real-time streaming and custom voice training using Coqui’s XTTS-v2 model for cross-lingual text-to-speech.

Installation

XTTS requires a running streaming server. Start the server using Docker:
docker run --gpus=all -e COQUI_TOS_AGREED=1 --rm -p 8000:80 \
  ghcr.io/coqui-ai/xtts-streaming-server:latest-cuda121

Prerequisites

XTTS Server Setup

Before using XTTSTTSService, you need:
  1. Docker Environment: Set up Docker with GPU support for optimal performance
  2. XTTS Server: Run the XTTS streaming server container
  3. Voice Models: Configure voice models and cloning samples as needed

Required Configuration

  • Server URL: Configure the XTTS server endpoint (default: http://localhost:8000)
  • Voice Selection: Set up voice models or voice cloning samples
GPU acceleration is recommended for optimal performance. The server requires CUDA support for best results.

Configuration

XTTSService

voice_id
str
required
ID of the studio speaker to use for synthesis.
base_url
str
required
Base URL of the XTTS streaming server (e.g. http://localhost:8000).
aiohttp_session
aiohttp.ClientSession
required
An aiohttp session for HTTP requests to the XTTS server.
language
Language
default:"Language.EN"
Language for synthesis. Supports Czech, German, English, Spanish, French, Hindi, Hungarian, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Russian, Turkish, and Chinese.
sample_rate
int
default:"None"
Output audio sample rate in Hz. When None, uses the pipeline’s configured sample rate. Audio is automatically resampled from XTTS’s native 24kHz output.

Usage

Basic Setup

import aiohttp
from pipecat.services.xtts import XTTSService

async with aiohttp.ClientSession() as session:
    tts = XTTSService(
        voice_id="Ana Florence",
        base_url="http://localhost:8000",
        aiohttp_session=session,
    )

With Language Configuration

import aiohttp
from pipecat.services.xtts import XTTSService
from pipecat.transcriptions.language import Language

async with aiohttp.ClientSession() as session:
    tts = XTTSService(
        voice_id="Ana Florence",
        base_url="http://localhost:8000",
        aiohttp_session=session,
        language=Language.ES,
    )

Notes

  • Local server required: XTTS requires a locally running streaming server (via Docker). The service connects to this server over HTTP.
  • Studio speakers: On startup, the service fetches available “studio speakers” from the server’s /studio_speakers endpoint. The voice_id must match one of these speakers.
  • Audio resampling: XTTS natively outputs audio at 24kHz. The service automatically resamples to match the pipeline’s configured sample rate.
  • GPU recommended: The XTTS server performs best with CUDA-enabled GPU acceleration. CPU inference is significantly slower.
  • No API key required: XTTS runs locally, so no external API credentials are needed.