Overview
GladiaSTTService provides real-time speech recognition using Gladia’s WebSocket API with support for 99+ languages, custom vocabulary, translation, sentiment analysis, and advanced audio processing features for comprehensive transcription.
Gladia STT API Reference
Pipecat’s API methods for Gladia STT integration
Example Implementation
Complete example with interruption handling
Gladia Documentation
Official Gladia documentation and features
Gladia Platform
Access multilingual transcription and API keys
Installation
To use Gladia services, install the required dependency:Prerequisites
Gladia Account Setup
Before using Gladia STT services, you need:- Gladia Account: Sign up at Gladia
- API Key: Generate an API key from your account dashboard
- Region Selection: Choose your preferred region (EU-West or US-West)
Required Environment Variables
GLADIA_API_KEY: Your Gladia API key for authenticationGLADIA_REGION: Your preferred region (optional, defaults to “eu-west”)
Configuration
GladiaSTTService
Gladia API key for authentication.
Region used to process audio. Defaults to
"eu-west" when None.Gladia API URL for session initialization.
Minimum confidence threshold for transcriptions (0.0-1.0). Deprecated — no confidence threshold is applied.
Audio sample rate in Hz. When
None, uses the pipeline’s configured sample rate.Model to use for transcription.
Additional configuration parameters for the Gladia service. See GladiaInputParams below.
Maximum size of audio buffer in bytes (default 20MB).
Whether the bot should be interrupted when Gladia VAD detects user speech.
P99 latency from speech end to final transcript in seconds. Override for your deployment.
GladiaInputParams
Parameters passed via theparams constructor argument. Import directly:
| Parameter | Type | Default | Description |
|---|---|---|---|
encoding | str | "wav/pcm" | Audio encoding format. |
bit_depth | int | 16 | Audio bit depth. |
channels | int | 1 | Number of audio channels. |
custom_metadata | Dict[str, Any] | None | Additional metadata to include with requests. |
endpointing | float | None | Silence duration in seconds to mark end of speech. |
maximum_duration_without_endpointing | int | 5 | Maximum utterance duration (seconds) without silence. |
language | Language | None | Language code for transcription. Deprecated — use language_config instead. |
language_config | LanguageConfig | None | Detailed language configuration with code switching support. |
pre_processing | PreProcessingConfig | None | Audio pre-processing options (audio enhancer, speech threshold). |
realtime_processing | RealtimeProcessingConfig | None | Real-time processing features (custom vocabulary, translation, NER, sentiment). |
messages_config | MessagesConfig | None | WebSocket message filtering options. |
enable_vad | bool | False | Enable Gladia VAD for end-of-utterance detection. Use without other VAD in the agent. |
Usage
Basic Setup
With Language Configuration
With Real-time Processing
Notes
- Session-based connection: Gladia uses a two-step connection process: first an HTTP POST to initialize a session, then a WebSocket connection to the returned session URL. The session URL and ID are managed automatically.
- Audio buffering: The service buffers audio data locally and sends it when connected. If the connection drops and reconnects, buffered audio is automatically re-sent to minimize transcript gaps.
- Keepalive: Empty audio chunks are sent periodically to keep the Gladia connection alive (keepalive interval: 5s, timeout: 20s).
- Built-in VAD: Set
enable_vad=TrueinGladiaInputParamsto use Gladia’s server-side VAD, which emitsUserStartedSpeakingFrameandUserStoppedSpeakingFrame. When using this, do not enable another VAD in your pipeline. - Translation: Gladia supports real-time translation to multiple target languages. Translation results are pushed as
TranslationFrames.
Event Handlers
Gladia STT supports the standard service connection events:| Event | Description |
|---|---|
on_connected | Connected to Gladia WebSocket |
on_disconnected | Disconnected from Gladia WebSocket |