Overview
OpenAI provides two STT service implementations:OpenAISTTServicefor VAD-segmented speech recognition using OpenAI’s transcription API (HTTP-based), supporting GPT-4o transcription and Whisper modelsOpenAIRealtimeSTTServicefor real-time streaming speech-to-text using OpenAI’s Realtime API WebSocket transcription sessions, with support for local VAD and server-side VAD modes
OpenAI STT API Reference
Pipecat’s API methods for OpenAI STT integration
Example Implementation
Complete example with OpenAI ecosystem integration
OpenAI Documentation
Official OpenAI transcription documentation and features
OpenAI Platform
Access API keys and transcription models
Installation
To use OpenAI services, install the required dependency:Prerequisites
OpenAI Account Setup
Before using OpenAI STT services, you need:- OpenAI Account: Sign up at OpenAI Platform
- API Key: Generate an API key from your account dashboard
- Model Access: Ensure access to Whisper and GPT-4o transcription models
Required Environment Variables
OPENAI_API_KEY: Your OpenAI API key for authentication
Configuration
OpenAISTTService
Uses VAD-based audio segmentation with HTTP transcription requests. It records speech segments detected by local VAD and sends them to OpenAI’s transcription API.Transcription model to use. Options include
"gpt-4o-transcribe", "gpt-4o-mini-transcribe", and "whisper-1".OpenAI API key. Falls back to the
OPENAI_API_KEY environment variable.API base URL. Override for custom or proxied deployments.
Language of the audio input.
Optional text to guide the model’s style or continue a previous segment.
Sampling temperature between 0 and 1. Lower values produce more deterministic results.
OpenAIRealtimeSTTService
Provides real-time streaming speech-to-text using OpenAI’s Realtime API WebSocket transcription sessions. Audio is streamed continuously over a WebSocket connection for lower latency compared to HTTP-based transcription.OpenAI API key for authentication.
Transcription model. Supported values are
"gpt-4o-transcribe" and "gpt-4o-mini-transcribe".WebSocket base URL for the Realtime API.
Language of the audio input.
Optional prompt text to guide transcription style or provide keyword hints.
Server-side VAD configuration. Defaults to
False (disabled), which relies on a local VAD processor in the pipeline. Pass None to use server defaults (server_vad), or a dict with custom settings (e.g. {"type": "server_vad", "threshold": 0.5}).Noise reduction mode.
"near_field" for close microphones, "far_field" for distant microphones, or None to disable.Whether to interrupt bot output when speech is detected by server-side VAD. Only applies when turn detection is enabled.
Usage
OpenAISTTService
OpenAIRealtimeSTTService with Local VAD
OpenAIRealtimeSTTService with Server-Side VAD
Notes
- Local VAD vs Server-side VAD:
OpenAIRealtimeSTTServicedefaults to local VAD mode (turn_detection=False), where a local VAD processor in the pipeline controls when audio is committed for transcription. Setturn_detection=Nonefor server-side VAD, but do not use a separate VAD processor in the pipeline in that mode. - Automatic resampling:
OpenAIRealtimeSTTServiceautomatically resamples audio to 24 kHz as required by the Realtime API, regardless of the pipeline’s sample rate. - Segmented vs streaming:
OpenAISTTServiceprocesses complete audio segments (after VAD detects silence) via HTTP.OpenAIRealtimeSTTServicestreams audio continuously over WebSocket for lower latency. - Interim transcriptions:
OpenAIRealtimeSTTServiceproduces interim transcriptions via delta events, whileOpenAISTTServiceonly produces final transcriptions.
Event Handlers
OpenAIRealtimeSTTService supports the standard service connection events:
| Event | Description |
|---|---|
on_connected | Connected to OpenAI Realtime WebSocket |
on_disconnected | Disconnected from OpenAI Realtime WebSocket |
OpenAISTTService uses HTTP requests and does not have WebSocket connection events.