Overview
Azure Cognitive Services provides high-quality text-to-speech synthesis with two service implementations:AzureTTSService (WebSocket-based) for real-time streaming with low latency, and AzureHttpTTSService (HTTP-based) for batch synthesis. AzureTTSService is recommended for interactive applications requiring streaming capabilities.
Azure TTS API Reference
Pipecat’s API methods for Azure TTS integration
Example Implementation
Complete example with streaming synthesis
Azure Speech Documentation
Official Azure Speech Services documentation
Voice Gallery
Browse available voices and languages
Installation
To use Azure services, install the required dependencies:Prerequisites
Azure Account Setup
Before using Azure TTS services, you need:- Azure Account: Sign up at Azure Portal
- Speech Service: Create a Speech resource in your Azure subscription
- API Key and Region: Get your subscription key and service region
- Voice Selection: Choose from available voices in the Voice Gallery
Required Environment Variables
AZURE_SPEECH_API_KEY: Your Azure Speech service API keyAZURE_SPEECH_REGION: Your Azure Speech service region (e.g., “eastus”)
Configuration
AzureTTSService
Azure Cognitive Services subscription key.
Azure region identifier (e.g.,
"eastus", "westus2").Voice name to use for synthesis.
Output audio sample rate in Hz. When
None, uses the pipeline’s configured sample rate.Whether to aggregate sentences before synthesis.
Runtime-configurable voice and synthesis settings. See InputParams below.
AzureHttpTTSService
The HTTP service accepts the same parameters as the streaming service exceptaggregate_sentences:
Azure Cognitive Services subscription key.
Azure region identifier.
Voice name to use for synthesis.
Output audio sample rate in Hz.
Voice and synthesis parameters. See InputParams below.
InputParams
Voice and synthesis settings shared by both service variants. Can be set at initialization via theparams constructor argument, or changed at runtime via UpdateSettingsFrame.
| Parameter | Type | Default | Description |
|---|---|---|---|
emphasis | str | None | Emphasis level for speech ("strong", "moderate", "reduced"). |
language | Language | Language.EN_US | Language for synthesis. |
pitch | str | None | Voice pitch adjustment (e.g., "+10%", "-5Hz", "high"). |
rate | str | None | Speech rate adjustment (e.g., "1.0", "1.25", "slow", "fast"). |
role | str | None | Voice role for expression (e.g., "YoungAdultFemale"). |
style | str | None | Speaking style (e.g., "cheerful", "sad", "excited"). |
style_degree | str | None | Intensity of the speaking style (0.01 to 2.0). |
volume | str | None | Volume level (e.g., "+20%", "loud", "x-soft"). |
Usage
Basic Setup
With Voice Customization
HTTP Service
Notes
- Streaming vs HTTP: The streaming service (
AzureTTSService) provides word-level timestamps and lower latency, making it better for interactive conversations. The HTTP service is simpler but returns the complete audio at once. - SSML support: Both services automatically construct SSML from the
InputParamssettings. Special characters in text are automatically escaped. - Word timestamps:
AzureTTSServicesupports word-level timestamps for synchronized text display. CJK languages receive special handling to merge individual characters into meaningful word units. - 8kHz workaround: At 8kHz sample rates, Azure’s reported audio duration may not match word boundary offsets. The service uses word boundary offsets for timing in this case.