Skip to main content

Overview

OpenAIRealtimeLLMService provides real-time, multimodal conversation capabilities using OpenAI’s Realtime API. It supports speech-to-speech interactions with integrated LLM processing, function calling, and advanced conversation management with minimal latency response times.

Installation

To use OpenAI Realtime services, install the required dependencies:
pip install "pipecat-ai[openai]"

Prerequisites

OpenAI Account Setup

Before using OpenAI Realtime services, you need:
  1. OpenAI Account: Sign up at OpenAI Platform
  2. API Key: Generate an OpenAI API key from your account dashboard
  3. Model Access: Ensure access to GPT-4o Realtime models
  4. Usage Limits: Configure appropriate usage limits and billing

Required Environment Variables

  • OPENAI_API_KEY: Your OpenAI API key for authentication

Key Features

  • Real-time Speech-to-Speech: Direct audio processing with minimal latency
  • Advanced Turn Detection: Multiple voice activity detection options including semantic detection
  • Function Calling: Seamless support for external functions and APIs
  • Voice Options: Multiple voice personalities and speaking styles
  • Conversation Management: Intelligent context handling and conversation flow control

Configuration

OpenAIRealtimeLLMService

api_key
str
required
OpenAI API key for authentication.
model
str
default:"gpt-realtime"
OpenAI Realtime model name. This is a connection-level parameter set via the WebSocket URL and cannot be changed during the session.
base_url
str
default:"wss://api.openai.com/v1/realtime"
WebSocket base URL for the Realtime API. Override for custom or proxied deployments.
session_properties
SessionProperties
default:"None"
Configuration properties for the realtime session. These are session-level settings that can be updated during the session (except for voice and model). See SessionProperties below.
start_audio_paused
bool
default:"False"
Whether to start with audio input paused. Useful when you want to control when audio processing begins.
start_video_paused
bool
default:"False"
Whether to start with video input paused.
video_frame_detail
str
default:"auto"
Detail level for video processing. Can be "auto", "low", or "high". "auto" lets the model decide, "low" is faster and uses fewer tokens, "high" provides more detail.

SessionProperties

Session-level configuration passed via the session_properties constructor argument. These settings can be updated during the session using LLMUpdateSettingsFrame.
ParameterTypeDefaultDescription
output_modalitiesList[Literal["text", "audio"]]NoneModalities the model can respond with. The API supports single modality responses: either ["text"] or ["audio"].
instructionsstrNoneSystem instructions for the assistant.
audioAudioConfigurationNoneConfiguration for input and output audio (format, transcription, turn detection, voice, speed).
toolsList[Dict]NoneAvailable function tools for the assistant.
tool_choiceLiteral["auto", "none", "required"]NoneTool usage strategy.
max_output_tokensint | Literal["inf"]NoneMaximum tokens in response, or "inf" for unlimited.
tracingLiteral["auto"] | DictNoneConfiguration options for tracing.

AudioConfiguration

The audio field in SessionProperties accepts an AudioConfiguration with input and output sub-configurations: AudioInput (audio.input):
ParameterTypeDefaultDescription
formatAudioFormatNoneInput audio format (PCMAudioFormat, PCMUAudioFormat, or PCMAAudioFormat).
transcriptionInputAudioTranscriptionNoneTranscription settings: model (e.g. "gpt-4o-transcribe"), language, and prompt.
noise_reductionInputAudioNoiseReductionNoneNoise reduction type: "near_field" or "far_field".
turn_detectionTurnDetection | SemanticTurnDetection | boolNoneTurn detection config, or False to disable server-side turn detection.
AudioOutput (audio.output):
ParameterTypeDefaultDescription
formatAudioFormatNoneOutput audio format.
voicestrNoneVoice the model uses to respond (e.g. "alloy", "echo", "shimmer").
speedfloatNoneSpeed of the model’s spoken response.

TurnDetection

Server-side VAD configuration via TurnDetection:
ParameterTypeDefaultDescription
typeLiteral["server_vad"]"server_vad"Detection type.
thresholdfloat0.5Voice activity detection threshold (0.0-1.0).
prefix_padding_msint300Padding before speech starts in milliseconds.
silence_duration_msint500Silence duration to detect speech end in milliseconds.
Alternatively, use SemanticTurnDetection for semantic-based detection:
ParameterTypeDefaultDescription
typeLiteral["semantic_vad"]"semantic_vad"Detection type.
eagernessLiteral["low", "medium", "high", "auto"]NoneTurn detection eagerness level.
create_responseboolNoneWhether to automatically create responses on turn detection.
interrupt_responseboolNoneWhether to interrupt ongoing responses on turn detection.

Usage

Basic Setup

import os
from pipecat.services.openai.realtime import OpenAIRealtimeLLMService

llm = OpenAIRealtimeLLMService(
    api_key=os.getenv("OPENAI_API_KEY"),
    model="gpt-4o-realtime-preview",
)

With Session Configuration

from pipecat.services.openai.realtime import OpenAIRealtimeLLMService
from pipecat.services.openai.realtime.events import (
    SessionProperties,
    AudioConfiguration,
    AudioInput,
    AudioOutput,
    InputAudioTranscription,
    SemanticTurnDetection,
)

session_properties = SessionProperties(
    instructions="You are a helpful assistant.",
    audio=AudioConfiguration(
        input=AudioInput(
            transcription=InputAudioTranscription(model="gpt-4o-transcribe"),
            turn_detection=SemanticTurnDetection(eagerness="medium"),
        ),
        output=AudioOutput(
            voice="alloy",
            speed=1.0,
        ),
    ),
    max_output_tokens=4096,
)

llm = OpenAIRealtimeLLMService(
    api_key=os.getenv("OPENAI_API_KEY"),
    model="gpt-4o-realtime-preview",
    session_properties=session_properties,
)

With Disabled Turn Detection (Manual Control)

session_properties = SessionProperties(
    audio=AudioConfiguration(
        input=AudioInput(
            turn_detection=False,
        ),
    ),
)

llm = OpenAIRealtimeLLMService(
    api_key=os.getenv("OPENAI_API_KEY"),
    model="gpt-4o-realtime-preview",
    session_properties=session_properties,
)

Updating Settings at Runtime

from pipecat.frames.frames import LLMUpdateSettingsFrame

await task.queue_frame(
    LLMUpdateSettingsFrame(
        settings={
            "instructions": "Now speak in Spanish.",
            "max_output_tokens": 2048,
        }
    )
)

Notes

  • Model is connection-level: The model parameter is set via the WebSocket URL at connection time and cannot be changed during a session.
  • Output modalities are single-mode: The API supports either ["text"] or ["audio"] output, not both simultaneously.
  • Turn detection options: Use TurnDetection for traditional VAD, SemanticTurnDetection for AI-based turn detection, or False to disable server-side detection and manage turns manually.
  • Audio output format: The service outputs 24kHz PCM audio by default.
  • Video support: Video frames can be sent to the model for multimodal input. Control the detail level with video_frame_detail and pause/resume with set_video_input_paused().
  • Transcription frames: User speech transcription frames are always emitted upstream when input audio transcription is configured.

Event Handlers

EventDescription
on_conversation_item_createdCalled when a new conversation item is created in the session
on_conversation_item_updatedCalled when a conversation item is updated or completed
@llm.event_handler("on_conversation_item_created")
async def on_item_created(service, item_id, item):
    print(f"New conversation item: {item_id}")

@llm.event_handler("on_conversation_item_updated")
async def on_item_updated(service, item_id, item):
    print(f"Conversation item updated: {item_id}")