Overview
WhisperSTTService provides offline speech recognition using OpenAI’s Whisper models running locally. Supports multiple model sizes and hardware acceleration options including CPU, CUDA, and Apple Silicon (MLX) for privacy-focused transcription without external API calls.
Whisper STT API Reference
Pipecat’s API methods for Whisper STT integration
Standard Whisper Example
Complete example with standard Whisper
Whisper Documentation
OpenAI’s Whisper research paper and model details
MLX Whisper Example
Apple Silicon optimized example
Installation
Choose your installation based on your hardware:Standard Whisper (CPU/CUDA)
MLX Whisper (Apple Silicon)
Prerequisites
Local Model Setup
Before using Whisper STT services, you need:- Model Selection: Choose appropriate Whisper model size (tiny, base, small, medium, large)
- Hardware Configuration: Set up CPU, CUDA, or Apple Silicon acceleration
- Storage Space: Ensure sufficient disk space for model downloads
Configuration Options
- Model Size: Balance between accuracy and performance based on your hardware
- Hardware Acceleration: Configure CUDA for NVIDIA GPUs or MLX for Apple Silicon
- Language Support: Whisper supports 99+ languages out of the box
Configuration
WhisperSTTService
Uses Faster Whisper for efficient local transcription on CPU or CUDA devices.Whisper model to use. Can be a
Model enum value or a string. Available models: TINY, BASE, SMALL, MEDIUM, LARGE (large-v3), LARGE_V3_TURBO, DISTIL_LARGE_V2, DISTIL_MEDIUM_EN (English-only).Device for inference. Options:
"cpu", "cuda", or "auto" (auto-detect).Compute type for inference. Options include
"default", "int8", "int8_float16", "float16", etc.Probability threshold for filtering out non-speech segments. Segments with a no-speech probability above this value are excluded.
Default language for transcription.
WhisperSTTServiceMLX
Optimized for Apple Silicon using MLX Whisper. Models are loaded on demand.MLX Whisper model to use. Can be an
MLXModel enum value or a string. Available models: TINY, MEDIUM, LARGE_V3, LARGE_V3_TURBO, DISTIL_LARGE_V3, LARGE_V3_TURBO_Q4 (quantized).Probability threshold for filtering out non-speech segments.
Default language for transcription.
Sampling temperature. Lower values produce more deterministic results.
Usage
Basic Faster Whisper Setup
With CUDA Acceleration
With Custom Language
MLX Whisper on Apple Silicon
Notes
- First run downloads: If the selected model hasn’t been downloaded previously, the first run will download it from the Hugging Face model hub. This may take significant time depending on model size.
- Segmented transcription: Both
WhisperSTTServiceandWhisperSTTServiceMLXextendSegmentedSTTService, meaning they process complete audio segments after VAD detects the user has stopped speaking. - No-speech filtering: The
no_speech_probthreshold helps filter out hallucinations. Increase it to be more permissive, decrease it to filter more aggressively. - MLX quantization: The
LARGE_V3_TURBO_Q4model provides reduced memory usage with minimal quality loss on Apple Silicon. - Language support: Whisper supports 99+ languages. Use the
Languageenum for type-safe language selection.