Skip to main content

Overview

OpenAI provides two STT service implementations:
  • OpenAISTTService for VAD-segmented speech recognition using OpenAI’s transcription API (HTTP-based), supporting GPT-4o transcription and Whisper models
  • OpenAIRealtimeSTTService for real-time streaming speech-to-text using OpenAI’s Realtime API WebSocket transcription sessions, with support for local VAD and server-side VAD modes

Installation

To use OpenAI services, install the required dependency:
pip install "pipecat-ai[openai]"

Prerequisites

OpenAI Account Setup

Before using OpenAI STT services, you need:
  1. OpenAI Account: Sign up at OpenAI Platform
  2. API Key: Generate an API key from your account dashboard
  3. Model Access: Ensure access to Whisper and GPT-4o transcription models

Required Environment Variables

  • OPENAI_API_KEY: Your OpenAI API key for authentication

Configuration

OpenAISTTService

Uses VAD-based audio segmentation with HTTP transcription requests. It records speech segments detected by local VAD and sends them to OpenAI’s transcription API.
model
str
default:"gpt-4o-transcribe"
Transcription model to use. Options include "gpt-4o-transcribe", "gpt-4o-mini-transcribe", and "whisper-1".
api_key
str
default:"None"
OpenAI API key. Falls back to the OPENAI_API_KEY environment variable.
base_url
str
default:"None"
API base URL. Override for custom or proxied deployments.
language
Language
default:"Language.EN"
Language of the audio input.
prompt
str
default:"None"
Optional text to guide the model’s style or continue a previous segment.
temperature
float
default:"None"
Sampling temperature between 0 and 1. Lower values produce more deterministic results.

OpenAIRealtimeSTTService

Provides real-time streaming speech-to-text using OpenAI’s Realtime API WebSocket transcription sessions. Audio is streamed continuously over a WebSocket connection for lower latency compared to HTTP-based transcription.
api_key
str
required
OpenAI API key for authentication.
model
str
default:"gpt-4o-transcribe"
Transcription model. Supported values are "gpt-4o-transcribe" and "gpt-4o-mini-transcribe".
base_url
str
default:"wss://api.openai.com/v1/realtime"
WebSocket base URL for the Realtime API.
language
Language
default:"Language.EN"
Language of the audio input.
prompt
str
default:"None"
Optional prompt text to guide transcription style or provide keyword hints.
turn_detection
dict | Literal[False]
default:"False"
Server-side VAD configuration. Defaults to False (disabled), which relies on a local VAD processor in the pipeline. Pass None to use server defaults (server_vad), or a dict with custom settings (e.g. {"type": "server_vad", "threshold": 0.5}).
noise_reduction
str
default:"None"
Noise reduction mode. "near_field" for close microphones, "far_field" for distant microphones, or None to disable.
should_interrupt
bool
default:"True"
Whether to interrupt bot output when speech is detected by server-side VAD. Only applies when turn detection is enabled.

Usage

OpenAISTTService

from pipecat.services.openai.stt import OpenAISTTService

stt = OpenAISTTService(
    api_key=os.getenv("OPENAI_API_KEY"),
    model="gpt-4o-transcribe",
)

OpenAIRealtimeSTTService with Local VAD

from pipecat.services.openai.stt import OpenAIRealtimeSTTService

# Local VAD mode (default) - use with a VAD processor in the pipeline
stt = OpenAIRealtimeSTTService(
    api_key=os.getenv("OPENAI_API_KEY"),
    model="gpt-4o-transcribe",
    noise_reduction="near_field",
)

OpenAIRealtimeSTTService with Server-Side VAD

from pipecat.services.openai.stt import OpenAIRealtimeSTTService

# Server-side VAD mode - do NOT use a separate VAD processor
stt = OpenAIRealtimeSTTService(
    api_key=os.getenv("OPENAI_API_KEY"),
    model="gpt-4o-transcribe",
    turn_detection=None,  # Enable server-side VAD
)

Notes

  • Local VAD vs Server-side VAD: OpenAIRealtimeSTTService defaults to local VAD mode (turn_detection=False), where a local VAD processor in the pipeline controls when audio is committed for transcription. Set turn_detection=None for server-side VAD, but do not use a separate VAD processor in the pipeline in that mode.
  • Automatic resampling: OpenAIRealtimeSTTService automatically resamples audio to 24 kHz as required by the Realtime API, regardless of the pipeline’s sample rate.
  • Segmented vs streaming: OpenAISTTService processes complete audio segments (after VAD detects silence) via HTTP. OpenAIRealtimeSTTService streams audio continuously over WebSocket for lower latency.
  • Interim transcriptions: OpenAIRealtimeSTTService produces interim transcriptions via delta events, while OpenAISTTService only produces final transcriptions.

Event Handlers

OpenAIRealtimeSTTService supports the standard service connection events:
EventDescription
on_connectedConnected to OpenAI Realtime WebSocket
on_disconnectedDisconnected from OpenAI Realtime WebSocket
@stt.event_handler("on_connected")
async def on_connected(service):
    print("Connected to OpenAI Realtime STT")
OpenAISTTService uses HTTP requests and does not have WebSocket connection events.