Cartesia

Overview

Cartesia provides high-quality text-to-speech synthesis with two service implementations: CartesiaTTSService (WebSocket-based) for real-time streaming with word timestamps, and CartesiaHttpTTSService (HTTP-based) for simpler batch synthesis. CartesiaTTSService is recommended for interactive applications requiring low latency and interruption handling.

Cartesia TTS API Reference

Pipecat’s API methods for Cartesia TTS integration

Example Implementation

Complete example with interruption handling

Cartesia Documentation

Official Cartesia API documentation and features

Voice Library

Browse and test available voices

Installation

To use Cartesia services, install the required dependencies:

pip install "pipecat-ai[cartesia]"

Prerequisites

Cartesia Account Setup

Before using Cartesia TTS services, you need:

Cartesia Account: Sign up at Cartesia
API Key: Generate an API key from your account dashboard
Voice Selection: Choose voice IDs from the voice library

Required Environment Variables

CARTESIA_API_KEY: Your Cartesia API key for authentication

Configuration

CartesiaTTSService

api_key

str

required

Cartesia API key for authentication.

voice_id

str

required

ID of the voice to use for synthesis.

model

str

default:"sonic-3"

TTS model to use.

cartesia_version

str

default:"2025-04-16"

API version string for Cartesia service.

url

str

default:"wss://api.cartesia.ai/tts/websocket"

WebSocket endpoint URL.

sample_rate

int

default:"None"

Output audio sample rate in Hz. When None, uses the pipeline’s configured sample rate.

encoding

str

default:"pcm_s16le"

Audio encoding format.

container

str

default:"raw"

Audio container format.

aggregate_sentences

bool

default:"True"

Buffer text until sentence boundaries before sending to Cartesia. Produces more natural-sounding speech.

params

InputParams

default:"None"

Runtime-configurable voice and generation settings. See InputParams below.

CartesiaHttpTTSService

The HTTP service accepts similar parameters to the WebSocket service, with these differences:

base_url

str

default:"https://api.cartesia.ai"

HTTP API base URL (instead of url for WebSocket).

cartesia_version

str

default:"2024-11-13"

API version for HTTP service.

The HTTP service does not accept aggregate_sentences.

InputParams

Voice and generation settings that can be set at initialization via the params constructor argument, or changed at runtime via UpdateSettingsFrame.

Parameter	Type	Default	Description
`language`	`Language`	`Language.EN`	Language code for synthesis.
`speed`	`Literal["slow", "normal", "fast"]`	`None`	Voice speed control for non-Sonic-3 models.
`emotion`	`list[str]`	`[]`	List of emotion controls for non-Sonic-3 models. Deprecated in v0.0.68.
`generation_config`	`GenerationConfig`	`None`	Generation configuration for Sonic-3 models. See below.
`pronunciation_dict_id`	`str`	`None`	ID of the pronunciation dictionary for custom pronunciations.

GenerationConfig (Sonic-3)

Configuration for Sonic-3 generation parameters:

Parameter	Type	Default	Description
`volume`	`float`	`None`	Volume multiplier. Valid range: [0.5, 2.0].
`speed`	`float`	`None`	Speed multiplier. Valid range: [0.6, 1.5].
`emotion`	`str`	`None`	Emotion string to guide tone (e.g., `"neutral"`, `"angry"`, `"excited"`). Over 60 emotions supported.

Usage

Basic Setup

from pipecat.services.cartesia import CartesiaTTSService

tts = CartesiaTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    voice_id="your-voice-id",
)

With Sonic-3 Generation Config

from pipecat.services.cartesia.tts import GenerationConfig

tts = CartesiaTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    voice_id="your-voice-id",
    model="sonic-3",
    params=CartesiaTTSService.InputParams(
        generation_config=GenerationConfig(
            speed=1.1,
            emotion="excited",
        ),
    ),
)

HTTP Service

from pipecat.services.cartesia import CartesiaHttpTTSService

tts = CartesiaHttpTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    voice_id="your-voice-id",
)

Customizing Speech

CartesiaTTSService provides a set of helper methods for implementing Cartesia-specific customizations, meant to be used as part of text transformers. These include methods for spelling out text, adjusting speech rate, and modifying pitch. See the Text Transformers for TTS section in the Text-to-Speech guide for usage examples.

SPELL(text: str) -> str:

A convenience method to wrap text in Cartesia’s spell tag for spelling out text character by character.

# Text transformers for TTS
# This will insert Cartesia's spell tags around the provided text.
async def spell_out_text(text: str, type: str) -> str:
    return CartesiaTTSService.SPELL(text)

tts = CartesiaTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    text_transforms=[
        ("phone_number", spell_out_text),
    ],
)

EMOTION_TAG(emotion: CartesiaEmotion) -> str:

A convenience method to create an emotion tag for expressing emotions in speech.

# Text transformers for TTS
# This will insert Cartesia's sarcasm tag in front of any sentence that is just "whatever".
async def maybe_insert_sarcasm(text: str, type: str) -> str:
    if text.strip(".!").lower() == "whatever":
        return CartesiaTTSService.EMOTION_TAG(CartesiaEmotion.SARCASM) + text + CartesiaTTSService.EMOTION_TAG(CartesiaEmotion.NEUTRAL)
    return text

tts = CartesiaTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    text_transforms=[
        ("sentence", maybe_insert_sarcasm),
    ],
)

PAUSE_TAG(seconds: float) -> str:

A convenience method to create Cartesia’s SSML tag for inserting pauses in speech.

# Text transformers for TTS
# This will insert a one second pause after questions.
async def pause_after_questions(text: str, type: str) -> str:
    if text.endswith("?"):
        return f"{text}{CartesiaTTSService.PAUSE_TAG(1.0)}"
    return text

tts = CartesiaTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    text_transforms=[
        ("sentence", pause_after_questions), # Only apply to sentence aggregations
    ],
)

VOLUME_TAG(volume: float) -> str:

A convenience method to create Cartesia’s SSML volume tag for dynamically adjusting speech volume in situ.

# Text transformers for TTS
# This will increase the volume for any full text aggregation that is in all caps.
async def maybe_say_it_loud(text: str, type: str) -> str:
    if text.upper() == text:
        return f"{CartesiaTTSService.VOLUME_TAG(2.0)}{text}{CartesiaTTSService.VOLUME_TAG(1.0)}"
    return text

tts = CartesiaTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    text_transforms=[
        ("*", maybe_say_it_loud), # Apply to all text
    ],
)

SPEED_TAG(speed: float) -> str:

A convenience method to create Cartesia’s SSML speed tag for dynamically adjusting the speech rate in situ.

# Text transformers for TTS
# This will make the word slow always be spoken more slowly.
async def slow_down_slow_words(text: str, type: str) -> str:
    return text.replace(
        "slow",
        f"{CartesiaTTSService.SPEED_TAG(0.6)}slow{CartesiaTTSService.SPEED_TAG(1.0)}"
    )

tts = CartesiaTTSService(
    api_key=os.getenv("CARTESIA_API_KEY"),
    text_transforms=[
        ("*", slow_down_slow_words), # Apply to all text
    ],
)

Notes

WebSocket vs HTTP: The WebSocket service supports word-level timestamps, audio context management, and interruption handling, making it better for interactive conversations. The HTTP service is simpler but lacks these features.
Sentence aggregation: Enabled by default. Buffering until sentence boundaries produces more natural speech with minimal latency impact. Disable with aggregate_sentences=False if you need word-by-word streaming.
Connection timeout: Cartesia WebSocket connections time out after 5 minutes of inactivity (no keepalive mechanism is available). The service automatically reconnects when needed.
CJK language support: For Chinese, Japanese, and Korean, the service combines individual characters from timestamp messages into meaningful word units.

Event Handlers

Cartesia TTS supports the standard service connection events:

Event	Description
`on_connected`	Connected to Cartesia WebSocket
`on_disconnected`	Disconnected from Cartesia WebSocket
`on_connection_error`	WebSocket connection error occurred

@tts.event_handler("on_connected")
async def on_connected(service):
    print("Connected to Cartesia")

API Reference

Services

Utilities

Frameworks

Pipeline

Overview

Cartesia TTS API Reference

Example Implementation

Cartesia Documentation

Voice Library

Installation

Prerequisites

Cartesia Account Setup

Required Environment Variables

Configuration

CartesiaTTSService

CartesiaHttpTTSService

InputParams

GenerationConfig (Sonic-3)

Usage

Basic Setup

With Sonic-3 Generation Config

HTTP Service

Customizing Speech

SPELL(text: str) -> str:

EMOTION_TAG(emotion: CartesiaEmotion) -> str:

PAUSE_TAG(seconds: float) -> str:

VOLUME_TAG(volume: float) -> str:

SPEED_TAG(speed: float) -> str:

Notes

Event Handlers

API Reference

Services

Utilities

Frameworks

Pipeline

​Overview

Cartesia TTS API Reference

Example Implementation

Cartesia Documentation

Voice Library

​Installation

​Prerequisites

​Cartesia Account Setup

​Required Environment Variables

​Configuration

​CartesiaTTSService

​CartesiaHttpTTSService

​InputParams

​GenerationConfig (Sonic-3)

​Usage

​Basic Setup

​With Sonic-3 Generation Config

​HTTP Service

​Customizing Speech

​SPELL(text: str) -> str:

​EMOTION_TAG(emotion: CartesiaEmotion) -> str:

​PAUSE_TAG(seconds: float) -> str:

​VOLUME_TAG(volume: float) -> str:

​SPEED_TAG(speed: float) -> str:

​Notes

​Event Handlers

Overview

Installation

Prerequisites

Cartesia Account Setup

Required Environment Variables

Configuration

CartesiaTTSService

CartesiaHttpTTSService

InputParams

GenerationConfig (Sonic-3)

Usage

Basic Setup

With Sonic-3 Generation Config

HTTP Service

Customizing Speech

SPELL(text: str) -> str:

EMOTION_TAG(emotion: CartesiaEmotion) -> str:

PAUSE_TAG(seconds: float) -> str:

VOLUME_TAG(volume: float) -> str:

SPEED_TAG(speed: float) -> str:

Notes

Event Handlers