Skip to main content

Overview

Azure Cognitive Services provides high-quality text-to-speech synthesis with two service implementations: AzureTTSService (WebSocket-based) for real-time streaming with low latency, and AzureHttpTTSService (HTTP-based) for batch synthesis. AzureTTSService is recommended for interactive applications requiring streaming capabilities.

Installation

To use Azure services, install the required dependencies:
pip install "pipecat-ai[azure]"

Prerequisites

Azure Account Setup

Before using Azure TTS services, you need:
  1. Azure Account: Sign up at Azure Portal
  2. Speech Service: Create a Speech resource in your Azure subscription
  3. API Key and Region: Get your subscription key and service region
  4. Voice Selection: Choose from available voices in the Voice Gallery

Required Environment Variables

  • AZURE_SPEECH_API_KEY: Your Azure Speech service API key
  • AZURE_SPEECH_REGION: Your Azure Speech service region (e.g., “eastus”)

Configuration

AzureTTSService

api_key
str
required
Azure Cognitive Services subscription key.
region
str
required
Azure region identifier (e.g., "eastus", "westus2").
voice
str
default:"en-US-SaraNeural"
Voice name to use for synthesis.
sample_rate
int
default:"None"
Output audio sample rate in Hz. When None, uses the pipeline’s configured sample rate.
aggregate_sentences
bool
default:"True"
Whether to aggregate sentences before synthesis.
params
InputParams
default:"None"
Runtime-configurable voice and synthesis settings. See InputParams below.

AzureHttpTTSService

The HTTP service accepts the same parameters as the streaming service except aggregate_sentences:
api_key
str
required
Azure Cognitive Services subscription key.
region
str
required
Azure region identifier.
voice
str
default:"en-US-SaraNeural"
Voice name to use for synthesis.
sample_rate
int
default:"None"
Output audio sample rate in Hz.
params
InputParams
default:"None"
Voice and synthesis parameters. See InputParams below.

InputParams

Voice and synthesis settings shared by both service variants. Can be set at initialization via the params constructor argument, or changed at runtime via UpdateSettingsFrame.
ParameterTypeDefaultDescription
emphasisstrNoneEmphasis level for speech ("strong", "moderate", "reduced").
languageLanguageLanguage.EN_USLanguage for synthesis.
pitchstrNoneVoice pitch adjustment (e.g., "+10%", "-5Hz", "high").
ratestrNoneSpeech rate adjustment (e.g., "1.0", "1.25", "slow", "fast").
rolestrNoneVoice role for expression (e.g., "YoungAdultFemale").
stylestrNoneSpeaking style (e.g., "cheerful", "sad", "excited").
style_degreestrNoneIntensity of the speaking style (0.01 to 2.0).
volumestrNoneVolume level (e.g., "+20%", "loud", "x-soft").

Usage

Basic Setup

from pipecat.services.azure import AzureTTSService

tts = AzureTTSService(
    api_key=os.getenv("AZURE_SPEECH_API_KEY"),
    region=os.getenv("AZURE_SPEECH_REGION"),
    voice="en-US-SaraNeural",
)

With Voice Customization

from pipecat.transcriptions.language import Language

tts = AzureTTSService(
    api_key=os.getenv("AZURE_SPEECH_API_KEY"),
    region="eastus",
    voice="en-US-JennyMultilingualNeural",
    params=AzureTTSService.InputParams(
        language=Language.EN_US,
        style="cheerful",
        style_degree="1.5",
        rate="1.1",
    ),
)

HTTP Service

from pipecat.services.azure import AzureHttpTTSService

tts = AzureHttpTTSService(
    api_key=os.getenv("AZURE_SPEECH_API_KEY"),
    region=os.getenv("AZURE_SPEECH_REGION"),
    voice="en-US-SaraNeural",
)

Notes

  • Streaming vs HTTP: The streaming service (AzureTTSService) provides word-level timestamps and lower latency, making it better for interactive conversations. The HTTP service is simpler but returns the complete audio at once.
  • SSML support: Both services automatically construct SSML from the InputParams settings. Special characters in text are automatically escaped.
  • Word timestamps: AzureTTSService supports word-level timestamps for synchronized text display. CJK languages receive special handling to merge individual characters into meaningful word units.
  • 8kHz workaround: At 8kHz sample rates, Azure’s reported audio duration may not match word boundary offsets. The service uses word boundary offsets for timing in this case.