Overview
Cartesia provides high-quality text-to-speech synthesis with two service implementations:CartesiaTTSService (WebSocket-based) for real-time streaming with word timestamps, and CartesiaHttpTTSService (HTTP-based) for simpler batch synthesis. CartesiaTTSService is recommended for interactive applications requiring low latency and interruption handling.
Cartesia TTS API Reference
Pipecat’s API methods for Cartesia TTS integration
Example Implementation
Complete example with interruption handling
Cartesia Documentation
Official Cartesia API documentation and features
Voice Library
Browse and test available voices
Installation
To use Cartesia services, install the required dependencies:Prerequisites
Cartesia Account Setup
Before using Cartesia TTS services, you need:- Cartesia Account: Sign up at Cartesia
- API Key: Generate an API key from your account dashboard
- Voice Selection: Choose voice IDs from the voice library
Required Environment Variables
CARTESIA_API_KEY: Your Cartesia API key for authentication
Configuration
CartesiaTTSService
Cartesia API key for authentication.
ID of the voice to use for synthesis.
TTS model to use.
API version string for Cartesia service.
WebSocket endpoint URL.
Output audio sample rate in Hz. When
None, uses the pipeline’s configured sample rate.Audio encoding format.
Audio container format.
Buffer text until sentence boundaries before sending to Cartesia. Produces more natural-sounding speech.
Runtime-configurable voice and generation settings. See InputParams below.
CartesiaHttpTTSService
The HTTP service accepts similar parameters to the WebSocket service, with these differences:HTTP API base URL (instead of
url for WebSocket).API version for HTTP service.
aggregate_sentences.
InputParams
Voice and generation settings that can be set at initialization via theparams constructor argument, or changed at runtime via UpdateSettingsFrame.
| Parameter | Type | Default | Description |
|---|---|---|---|
language | Language | Language.EN | Language code for synthesis. |
speed | Literal["slow", "normal", "fast"] | None | Voice speed control for non-Sonic-3 models. |
emotion | list[str] | [] | List of emotion controls for non-Sonic-3 models. Deprecated in v0.0.68. |
generation_config | GenerationConfig | None | Generation configuration for Sonic-3 models. See below. |
pronunciation_dict_id | str | None | ID of the pronunciation dictionary for custom pronunciations. |
GenerationConfig (Sonic-3)
Configuration for Sonic-3 generation parameters:| Parameter | Type | Default | Description |
|---|---|---|---|
volume | float | None | Volume multiplier. Valid range: [0.5, 2.0]. |
speed | float | None | Speed multiplier. Valid range: [0.6, 1.5]. |
emotion | str | None | Emotion string to guide tone (e.g., "neutral", "angry", "excited"). Over 60 emotions supported. |
Usage
Basic Setup
With Sonic-3 Generation Config
HTTP Service
Customizing Speech
CartesiaTTSService provides a set of helper methods for implementing Cartesia-specific customizations, meant to be used as part of text transformers. These include methods for spelling out text, adjusting speech rate, and modifying pitch. See the Text Transformers for TTS section in the Text-to-Speech guide for usage examples.
SPELL(text: str) -> str:
A convenience method to wrap text in Cartesia’s spell tag for spelling out text character by character.EMOTION_TAG(emotion: CartesiaEmotion) -> str:
A convenience method to create an emotion tag for expressing emotions in speech.PAUSE_TAG(seconds: float) -> str:
A convenience method to create Cartesia’s SSML tag for inserting pauses in speech.VOLUME_TAG(volume: float) -> str:
A convenience method to create Cartesia’s SSML volume tag for dynamically adjusting speech volume in situ.SPEED_TAG(speed: float) -> str:
A convenience method to create Cartesia’s SSML speed tag for dynamically adjusting the speech rate in situ.Notes
- WebSocket vs HTTP: The WebSocket service supports word-level timestamps, audio context management, and interruption handling, making it better for interactive conversations. The HTTP service is simpler but lacks these features.
- Sentence aggregation: Enabled by default. Buffering until sentence boundaries produces more natural speech with minimal latency impact. Disable with
aggregate_sentences=Falseif you need word-by-word streaming. - Connection timeout: Cartesia WebSocket connections time out after 5 minutes of inactivity (no keepalive mechanism is available). The service automatically reconnects when needed.
- CJK language support: For Chinese, Japanese, and Korean, the service combines individual characters from timestamp messages into meaningful word units.
Event Handlers
Cartesia TTS supports the standard service connection events:| Event | Description |
|---|---|
on_connected | Connected to Cartesia WebSocket |
on_disconnected | Disconnected from Cartesia WebSocket |
on_connection_error | WebSocket connection error occurred |