Skip to main content

Overview

Cartesia provides high-quality text-to-speech synthesis with two service implementations: CartesiaTTSService (WebSocket-based) for real-time streaming with word timestamps, and CartesiaHttpTTSService (HTTP-based) for simpler batch synthesis. CartesiaTTSService is recommended for interactive applications requiring low latency and interruption handling.

Installation

To use Cartesia services, install the required dependencies:
pip install "pipecat-ai[cartesia]"

Prerequisites

Cartesia Account Setup

Before using Cartesia TTS services, you need:
  1. Cartesia Account: Sign up at Cartesia
  2. API Key: Generate an API key from your account dashboard
  3. Voice Selection: Choose voice IDs from the voice library

Required Environment Variables

  • CARTESIA_API_KEY: Your Cartesia API key for authentication

Configuration

CartesiaTTSService

api_key
str
required
Cartesia API key for authentication.
voice_id
str
required
ID of the voice to use for synthesis.
model
str
default:"sonic-3"
TTS model to use.
cartesia_version
str
default:"2025-04-16"
API version string for Cartesia service.
url
str
default:"wss://api.cartesia.ai/tts/websocket"
WebSocket endpoint URL.
sample_rate
int
default:"None"
Output audio sample rate in Hz. When None, uses the pipeline’s configured sample rate.
encoding
str
default:"pcm_s16le"
Audio encoding format.
container
str
default:"raw"
Audio container format.
aggregate_sentences
bool
default:"True"
Buffer text until sentence boundaries before sending to Cartesia. Produces more natural-sounding speech.
params
InputParams
default:"None"
Runtime-configurable voice and generation settings. See InputParams below.

CartesiaHttpTTSService

The HTTP service accepts similar parameters to the WebSocket service, with these differences:
base_url
str
default:"https://api.cartesia.ai"
HTTP API base URL (instead of url for WebSocket).
cartesia_version
str
default:"2024-11-13"
API version for HTTP service.
The HTTP service does not accept aggregate_sentences.

InputParams

Voice and generation settings that can be set at initialization via the params constructor argument, or changed at runtime via UpdateSettingsFrame.
ParameterTypeDefaultDescription
languageLanguageLanguage.ENLanguage code for synthesis.
speedLiteral["slow", "normal", "fast"]NoneVoice speed control for non-Sonic-3 models.
emotionlist[str][]List of emotion controls for non-Sonic-3 models. Deprecated in v0.0.68.
generation_configGenerationConfigNoneGeneration configuration for Sonic-3 models. See below.
pronunciation_dict_idstrNoneID of the pronunciation dictionary for custom pronunciations.

GenerationConfig (Sonic-3)

Configuration for Sonic-3 generation parameters:
ParameterTypeDefaultDescription
volumefloatNoneVolume multiplier. Valid range: [0.5, 2.0].
speedfloatNoneSpeed multiplier. Valid range: [0.6, 1.5].
emotionstrNoneEmotion string to guide tone (e.g., "neutral", "angry", "excited"). Over 60 emotions supported.

Usage

Basic Setup

from pipecat.services.cartesia import CartesiaTTSService

tts = CartesiaTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    voice_id="your-voice-id",
)

With Sonic-3 Generation Config

from pipecat.services.cartesia.tts import GenerationConfig

tts = CartesiaTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    voice_id="your-voice-id",
    model="sonic-3",
    params=CartesiaTTSService.InputParams(
        generation_config=GenerationConfig(
            speed=1.1,
            emotion="excited",
        ),
    ),
)

HTTP Service

from pipecat.services.cartesia import CartesiaHttpTTSService

tts = CartesiaHttpTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    voice_id="your-voice-id",
)

Customizing Speech

CartesiaTTSService provides a set of helper methods for implementing Cartesia-specific customizations, meant to be used as part of text transformers. These include methods for spelling out text, adjusting speech rate, and modifying pitch. See the Text Transformers for TTS section in the Text-to-Speech guide for usage examples.

SPELL(text: str) -> str:

A convenience method to wrap text in Cartesia’s spell tag for spelling out text character by character.
# Text transformers for TTS
# This will insert Cartesia's spell tags around the provided text.
async def spell_out_text(text: str, type: str) -> str:
    return CartesiaTTSService.SPELL(text)

tts = CartesiaTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    text_transforms=[
        ("phone_number", spell_out_text),
    ],
)

EMOTION_TAG(emotion: CartesiaEmotion) -> str:

A convenience method to create an emotion tag for expressing emotions in speech.
# Text transformers for TTS
# This will insert Cartesia's sarcasm tag in front of any sentence that is just "whatever".
async def maybe_insert_sarcasm(text: str, type: str) -> str:
    if text.strip(".!").lower() == "whatever":
        return CartesiaTTSService.EMOTION_TAG(CartesiaEmotion.SARCASM) + text + CartesiaTTSService.EMOTION_TAG(CartesiaEmotion.NEUTRAL)
    return text

tts = CartesiaTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    text_transforms=[
        ("sentence", maybe_insert_sarcasm),
    ],
)

PAUSE_TAG(seconds: float) -> str:

A convenience method to create Cartesia’s SSML tag for inserting pauses in speech.
# Text transformers for TTS
# This will insert a one second pause after questions.
async def pause_after_questions(text: str, type: str) -> str:
    if text.endswith("?"):
        return f"{text}{CartesiaTTSService.PAUSE_TAG(1.0)}"
    return text

tts = CartesiaTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    text_transforms=[
        ("sentence", pause_after_questions), # Only apply to sentence aggregations
    ],
)

VOLUME_TAG(volume: float) -> str:

A convenience method to create Cartesia’s SSML volume tag for dynamically adjusting speech volume in situ.
# Text transformers for TTS
# This will increase the volume for any full text aggregation that is in all caps.
async def maybe_say_it_loud(text: str, type: str) -> str:
    if text.upper() == text:
        return f"{CartesiaTTSService.VOLUME_TAG(2.0)}{text}{CartesiaTTSService.VOLUME_TAG(1.0)}"
    return text

tts = CartesiaTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    text_transforms=[
        ("*", maybe_say_it_loud), # Apply to all text
    ],
)

SPEED_TAG(speed: float) -> str:

A convenience method to create Cartesia’s SSML speed tag for dynamically adjusting the speech rate in situ.
# Text transformers for TTS
# This will make the word slow always be spoken more slowly.
async def slow_down_slow_words(text: str, type: str) -> str:
    return text.replace(
        "slow",
        f"{CartesiaTTSService.SPEED_TAG(0.6)}slow{CartesiaTTSService.SPEED_TAG(1.0)}"
    )

tts = CartesiaTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    text_transforms=[
        ("*", slow_down_slow_words), # Apply to all text
    ],
)

Notes

  • WebSocket vs HTTP: The WebSocket service supports word-level timestamps, audio context management, and interruption handling, making it better for interactive conversations. The HTTP service is simpler but lacks these features.
  • Sentence aggregation: Enabled by default. Buffering until sentence boundaries produces more natural speech with minimal latency impact. Disable with aggregate_sentences=False if you need word-by-word streaming.
  • Connection timeout: Cartesia WebSocket connections time out after 5 minutes of inactivity (no keepalive mechanism is available). The service automatically reconnects when needed.
  • CJK language support: For Chinese, Japanese, and Korean, the service combines individual characters from timestamp messages into meaningful word units.

Event Handlers

Cartesia TTS supports the standard service connection events:
EventDescription
on_connectedConnected to Cartesia WebSocket
on_disconnectedDisconnected from Cartesia WebSocket
on_connection_errorWebSocket connection error occurred
@tts.event_handler("on_connected")
async def on_connected(service):
    print("Connected to Cartesia")