Skip to main content

Overview

WhisperSTTService provides offline speech recognition using OpenAI’s Whisper models running locally. Supports multiple model sizes and hardware acceleration options including CPU, CUDA, and Apple Silicon (MLX) for privacy-focused transcription without external API calls.

Installation

Choose your installation based on your hardware:

Standard Whisper (CPU/CUDA)

pip install "pipecat-ai[whisper]"

MLX Whisper (Apple Silicon)

pip install "pipecat-ai[mlx-whisper]"

Prerequisites

Local Model Setup

Before using Whisper STT services, you need:
  1. Model Selection: Choose appropriate Whisper model size (tiny, base, small, medium, large)
  2. Hardware Configuration: Set up CPU, CUDA, or Apple Silicon acceleration
  3. Storage Space: Ensure sufficient disk space for model downloads

Configuration Options

  • Model Size: Balance between accuracy and performance based on your hardware
  • Hardware Acceleration: Configure CUDA for NVIDIA GPUs or MLX for Apple Silicon
  • Language Support: Whisper supports 99+ languages out of the box
No API keys required - Whisper runs entirely locally for complete privacy.

Configuration

WhisperSTTService

Uses Faster Whisper for efficient local transcription on CPU or CUDA devices.
model
str | Model
default:"Model.DISTIL_MEDIUM_EN"
Whisper model to use. Can be a Model enum value or a string. Available models: TINY, BASE, SMALL, MEDIUM, LARGE (large-v3), LARGE_V3_TURBO, DISTIL_LARGE_V2, DISTIL_MEDIUM_EN (English-only).
device
str
default:"auto"
Device for inference. Options: "cpu", "cuda", or "auto" (auto-detect).
compute_type
str
default:"default"
Compute type for inference. Options include "default", "int8", "int8_float16", "float16", etc.
no_speech_prob
float
default:"0.4"
Probability threshold for filtering out non-speech segments. Segments with a no-speech probability above this value are excluded.
language
Language
default:"Language.EN"
Default language for transcription.

WhisperSTTServiceMLX

Optimized for Apple Silicon using MLX Whisper. Models are loaded on demand.
model
str | MLXModel
default:"MLXModel.TINY"
MLX Whisper model to use. Can be an MLXModel enum value or a string. Available models: TINY, MEDIUM, LARGE_V3, LARGE_V3_TURBO, DISTIL_LARGE_V3, LARGE_V3_TURBO_Q4 (quantized).
no_speech_prob
float
default:"0.6"
Probability threshold for filtering out non-speech segments.
language
Language
default:"Language.EN"
Default language for transcription.
temperature
float
default:"0.0"
Sampling temperature. Lower values produce more deterministic results.

Usage

Basic Faster Whisper Setup

from pipecat.services.whisper import WhisperSTTService

stt = WhisperSTTService(
    model="base",
)

With CUDA Acceleration

from pipecat.services.whisper import WhisperSTTService, Model

stt = WhisperSTTService(
    model=Model.LARGE,
    device="cuda",
    compute_type="float16",
)

With Custom Language

from pipecat.services.whisper import WhisperSTTService, Model
from pipecat.transcriptions.language import Language

stt = WhisperSTTService(
    model=Model.MEDIUM,
    language=Language.FR,
    no_speech_prob=0.5,
)

MLX Whisper on Apple Silicon

from pipecat.services.whisper import WhisperSTTServiceMLX, MLXModel
from pipecat.transcriptions.language import Language

stt = WhisperSTTServiceMLX(
    model=MLXModel.LARGE_V3_TURBO,
    language=Language.EN,
    temperature=0.0,
)

Notes

  • First run downloads: If the selected model hasn’t been downloaded previously, the first run will download it from the Hugging Face model hub. This may take significant time depending on model size.
  • Segmented transcription: Both WhisperSTTService and WhisperSTTServiceMLX extend SegmentedSTTService, meaning they process complete audio segments after VAD detects the user has stopped speaking.
  • No-speech filtering: The no_speech_prob threshold helps filter out hallucinations. Increase it to be more permissive, decrease it to filter more aggressively.
  • MLX quantization: The LARGE_V3_TURBO_Q4 model provides reduced memory usage with minimal quality loss on Apple Silicon.
  • Language support: Whisper supports 99+ languages. Use the Language enum for type-safe language selection.