Overview
OpenAIRealtimeLLMService provides real-time, multimodal conversation capabilities using OpenAI’s Realtime API. It supports speech-to-speech interactions with integrated LLM processing, function calling, and advanced conversation management with minimal latency response times.
OpenAI Realtime API Reference
Pipecat’s API methods for OpenAI Realtime integration
Example Implementation
Complete OpenAI Realtime conversation example
OpenAI Documentation
Official OpenAI Realtime API documentation
OpenAI Platform
Access Realtime models and manage API keys
Installation
To use OpenAI Realtime services, install the required dependencies:Prerequisites
OpenAI Account Setup
Before using OpenAI Realtime services, you need:- OpenAI Account: Sign up at OpenAI Platform
- API Key: Generate an OpenAI API key from your account dashboard
- Model Access: Ensure access to GPT-4o Realtime models
- Usage Limits: Configure appropriate usage limits and billing
Required Environment Variables
OPENAI_API_KEY: Your OpenAI API key for authentication
Key Features
- Real-time Speech-to-Speech: Direct audio processing with minimal latency
- Advanced Turn Detection: Multiple voice activity detection options including semantic detection
- Function Calling: Seamless support for external functions and APIs
- Voice Options: Multiple voice personalities and speaking styles
- Conversation Management: Intelligent context handling and conversation flow control
Configuration
OpenAIRealtimeLLMService
OpenAI API key for authentication.
OpenAI Realtime model name. This is a connection-level parameter set via the WebSocket URL and cannot be changed during the session.
WebSocket base URL for the Realtime API. Override for custom or proxied deployments.
Configuration properties for the realtime session. These are session-level settings that can be updated during the session (except for voice and model). See SessionProperties below.
Whether to start with audio input paused. Useful when you want to control when audio processing begins.
Whether to start with video input paused.
Detail level for video processing. Can be
"auto", "low", or "high". "auto" lets the model decide, "low" is faster and uses fewer tokens, "high" provides more detail.SessionProperties
Session-level configuration passed via thesession_properties constructor argument. These settings can be updated during the session using LLMUpdateSettingsFrame.
| Parameter | Type | Default | Description |
|---|---|---|---|
output_modalities | List[Literal["text", "audio"]] | None | Modalities the model can respond with. The API supports single modality responses: either ["text"] or ["audio"]. |
instructions | str | None | System instructions for the assistant. |
audio | AudioConfiguration | None | Configuration for input and output audio (format, transcription, turn detection, voice, speed). |
tools | List[Dict] | None | Available function tools for the assistant. |
tool_choice | Literal["auto", "none", "required"] | None | Tool usage strategy. |
max_output_tokens | int | Literal["inf"] | None | Maximum tokens in response, or "inf" for unlimited. |
tracing | Literal["auto"] | Dict | None | Configuration options for tracing. |
AudioConfiguration
Theaudio field in SessionProperties accepts an AudioConfiguration with input and output sub-configurations:
AudioInput (audio.input):
| Parameter | Type | Default | Description |
|---|---|---|---|
format | AudioFormat | None | Input audio format (PCMAudioFormat, PCMUAudioFormat, or PCMAAudioFormat). |
transcription | InputAudioTranscription | None | Transcription settings: model (e.g. "gpt-4o-transcribe"), language, and prompt. |
noise_reduction | InputAudioNoiseReduction | None | Noise reduction type: "near_field" or "far_field". |
turn_detection | TurnDetection | SemanticTurnDetection | bool | None | Turn detection config, or False to disable server-side turn detection. |
audio.output):
| Parameter | Type | Default | Description |
|---|---|---|---|
format | AudioFormat | None | Output audio format. |
voice | str | None | Voice the model uses to respond (e.g. "alloy", "echo", "shimmer"). |
speed | float | None | Speed of the model’s spoken response. |
TurnDetection
Server-side VAD configuration viaTurnDetection:
| Parameter | Type | Default | Description |
|---|---|---|---|
type | Literal["server_vad"] | "server_vad" | Detection type. |
threshold | float | 0.5 | Voice activity detection threshold (0.0-1.0). |
prefix_padding_ms | int | 300 | Padding before speech starts in milliseconds. |
silence_duration_ms | int | 500 | Silence duration to detect speech end in milliseconds. |
SemanticTurnDetection for semantic-based detection:
| Parameter | Type | Default | Description |
|---|---|---|---|
type | Literal["semantic_vad"] | "semantic_vad" | Detection type. |
eagerness | Literal["low", "medium", "high", "auto"] | None | Turn detection eagerness level. |
create_response | bool | None | Whether to automatically create responses on turn detection. |
interrupt_response | bool | None | Whether to interrupt ongoing responses on turn detection. |
Usage
Basic Setup
With Session Configuration
With Disabled Turn Detection (Manual Control)
Updating Settings at Runtime
Notes
- Model is connection-level: The
modelparameter is set via the WebSocket URL at connection time and cannot be changed during a session. - Output modalities are single-mode: The API supports either
["text"]or["audio"]output, not both simultaneously. - Turn detection options: Use
TurnDetectionfor traditional VAD,SemanticTurnDetectionfor AI-based turn detection, orFalseto disable server-side detection and manage turns manually. - Audio output format: The service outputs 24kHz PCM audio by default.
- Video support: Video frames can be sent to the model for multimodal input. Control the detail level with
video_frame_detailand pause/resume withset_video_input_paused(). - Transcription frames: User speech transcription frames are always emitted upstream when input audio transcription is configured.
Event Handlers
| Event | Description |
|---|---|
on_conversation_item_created | Called when a new conversation item is created in the session |
on_conversation_item_updated | Called when a conversation item is updated or completed |