Overview
SonioxSTTService provides real-time speech-to-text transcription using Soniox’s WebSocket API with support for over 60 languages, custom context, multiple languages in the same conversation, and advanced features for accurate multilingual transcription.
By default, Soniox uses the stt-rt-v4 model with vad_force_turn_endpoint=True, which disables Soniox’s native turn detection and relies on Pipecat’s local VAD to finalize transcripts. This configuration significantly reduces the time to final segment (~250ms median). Pipecat enables smart-turn detection by default using LocalSmartTurnAnalyzerV3. To use Soniox’s native turn detection instead, set vad_force_turn_endpoint=False.
Soniox STT API Reference
Pipecat’s API methods for Soniox STT integration
Example Implementation
Complete example with interruption handling
Soniox Documentation
Official Soniox documentation and features
Soniox Console
Access multilingual models and API keys
Installation
To use Soniox services, install the required dependencies:Prerequisites
Soniox Account Setup
Before using Soniox STT services, you need:- Soniox Account: Sign up at Soniox Console
- API Key: Generate an API key from your console dashboard
- Language Selection: Choose from 60+ supported languages and models
Required Environment Variables
SONIOX_API_KEY: Your Soniox API key for authentication
Configuration
SonioxSTTService
Soniox API key for authentication.
Soniox WebSocket API URL.
Audio sample rate in Hz. When
None, uses the pipeline’s configured sample rate.Configuration parameters for model, language, and features. See SonioxInputParams below.
Listen to
VADUserStoppedSpeakingFrame to send a finalize message to Soniox. When enabled, Pipecat’s local VAD triggers transcript finalization. When disabled, Soniox detects the end of speech natively.SonioxInputParams
Settings that can be set at initialization via theparams constructor argument.
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | "stt-rt-v4" | Model to use for transcription. |
audio_format | str | "pcm_s16le" | Audio format for transcription. |
num_channels | int | 1 | Number of audio channels. |
language_hints | list[Language] | None | Language hints for transcription. Helps the model prioritize expected languages. |
language_hints_strict | bool | None | If true, strictly enforce language hints (only transcribe in provided languages). |
context | SonioxContextObject | str | None | Customization for transcription. String for models with context_version 1, SonioxContextObject for context_version 2 (stt-rt-v3-preview and higher). |
enable_speaker_diarization | bool | False | Enable speaker diarization. Tokens are annotated with speaker IDs. |
enable_language_identification | bool | False | Enable language identification. Tokens are annotated with language IDs. |
client_reference_id | str | None | Client reference ID for transcription tracking. |
Usage
Basic Setup
With Language Hints and Context
With Context Object (v3+ models)
With Soniox Native Turn Detection
Notes
- Turn finalization: By default (
vad_force_turn_endpoint=True), when Pipecat’s VAD detects the user has stopped speaking, a finalize message is sent to Soniox to get the final transcript immediately. This significantly reduces latency. - Keepalive: The service automatically sends protocol-level keepalive messages to maintain the WebSocket connection.
- Context versions: Use a string for
contextwith older models (context_version 1) andSonioxContextObjectfor newer models (stt-rt-v3-preview and higher, context_version 2). See the Soniox context documentation for details.
Event Handlers
Soniox STT supports the standard service connection events:| Event | Description |
|---|---|
on_connected | Connected to Soniox WebSocket |
on_disconnected | Disconnected from Soniox WebSocket |