Skip to main content

Overview

SpeechmaticsSTTService enables real-time speech transcription using Speechmatics’ WebSocket API with partial and final results, speaker diarization, and end of utterance detection (VAD) for comprehensive conversation analysis.
Since Speechmatics provides its own user turn start and end detection, you should use ExternalUserTurnStrategies to let Speechmatics handle turn management. See User Turn Strategies for configuration details.

Installation

To use Speechmatics services, install the required dependencies:
pip install "pipecat-ai[speechmatics]"

Prerequisites

Speechmatics Account Setup

Before using Speechmatics STT services, you need:
  1. Speechmatics Account: Sign up at Speechmatics
  2. API Key: Generate an API key from your account dashboard
  3. Feature Selection: Configure transcription features like speaker diarization

Select Endpoint

Speechmatics STT supports the following endpoints (defaults to EU2):
RegionEnvironmentSTT EndpointAccess
EUEU1wss://neu.rt.speechmatics.com/Self-Service / Enterprise
EUEU2 (Default)wss://eu2.rt.speechmatics.com/Self-Service / Enterprise
USUS1wss://wus.rt.speechmatics.com/Enterprise

Required Environment Variables

  • SPEECHMATICS_API_KEY: Your Speechmatics API key for authentication
  • SPEECHMATICS_RT_URL: Speechmatics endpoint URL (optional, defaults to EU2)

Configuration

SpeechmaticsSTTService

api_key
str
default:"None"
Speechmatics API key. Falls back to the SPEECHMATICS_API_KEY environment variable.
base_url
str
default:"None"
Base URL for the Speechmatics API. Falls back to SPEECHMATICS_RT_URL environment variable, then defaults to wss://eu2.rt.speechmatics.com/v2.
sample_rate
int
default:"None"
Audio sample rate in Hz. When None, uses the pipeline’s configured sample rate.
params
InputParams
default:"None"
Configuration parameters. See InputParams below.
should_interrupt
bool
default:"True"
Whether to interrupt bot output when Speechmatics detects user speech. Only applies when turn_detection_mode is set to detect speech (ADAPTIVE or SMART_TURN).

InputParams

Settings that can be set at initialization via the params constructor argument.
ParameterTypeDefaultDescription
domainstrNoneDomain for Speechmatics API (e.g. for bilingual transcription).
languageLanguage | strLanguage.ENLanguage code for transcription.
turn_detection_modeTurnDetectionModeEXTERNALEndpoint handling mode. EXTERNAL (default) uses Pipecat’s VAD, ADAPTIVE uses Speechmatics’ VAD, SMART_TURN uses Speechmatics’ ML-based turn detection.
speaker_active_formatstrNoneFormatter for active speaker output. Available attributes: {speaker_id}, {text}. Example: "@{speaker_id}: {text}".
speaker_passive_formatstrNoneFormatter for passive/background speaker output. Same attributes as active format.
focus_speakerslist[str][]Speaker IDs to focus on. Only these speakers drive end of turn and conversation flow.
ignore_speakerslist[str][]Speaker IDs to exclude from transcription entirely.
focus_modeSpeakerFocusModeRETAINRETAIN keeps words from non-focused speakers; IGNORE drops them.
known_speakerslist[SpeakerIdentifier][]Known speaker labels and identifiers for speaker attribution.
additional_vocablist[AdditionalVocabEntry][]Additional vocabulary to boost recognition of specific words.
audio_encodingAudioEncodingPCM_S16LEAudio encoding format.
operating_pointOperatingPointNoneTranscription accuracy vs. latency tradeoff. ENHANCED recommended for most use cases.
max_delayfloatNoneMaximum delay in seconds for transcription. Lower values reduce latency but may impact accuracy.
end_of_utterance_silence_triggerfloatNoneSilence duration in seconds to trigger end of utterance. Must be lower than max_delay.
end_of_utterance_max_delayfloatNoneMaximum delay for end of utterance. Must be greater than end_of_utterance_silence_trigger.
punctuation_overridesdictNoneCustom punctuation overrides for the STT engine.
include_partialsboolNoneInclude partial word fragments in partial segment output.
split_sentencesboolNoneEmit finalized sentences mid-turn as they are completed.
enable_diarizationboolNoneEnable speaker diarization to attribute words to unique speakers.
speaker_sensitivityfloatNoneDiarization sensitivity. Higher values help distinguish similar voices.
max_speakersintNoneMaximum number of speakers to detect. Only use when the speaker count is known.
prefer_current_speakerboolNoneGive extra weight to grouping nearby words as the same speaker.
extra_paramsdictNoneAdditional parameters passed to the STT engine.

End of Turn detection

The Speechmatics STT service supports Pipecat’s own end of turn detection (Silero VAD and Smart Turn) without any additional configuration. When using Pipecat’s features, the turn_detection_mode must be set to TurnDetectionMode.EXTERNAL (which is the default).

Default mode

By default, Speechmatics uses signals from Pipecat’s VAD / smart turn detection as input to trigger the end of turn and finalization of the current transcript segment. This provides a seamless integration where Pipecat’s voice activity detection and turn detection work in conjunction with Speechmatics’ real-time processing capabilities.
If you wish to use features such as focussing on or ignoring other speakers, then you may see benefit from using TurnDetectionMode.ADAPTIVE or TurnDetectionMode.SMART_TURN modes.

Adaptive End of Turn detection

This mode looks at the content of the speech, pace of speaking and other acoustic information (using VAD) to determine when the user has finished speaking. This is especially important when using the plugin’s ability to focus on a specific speaker and not have other speakers interrupt the agent / conversation. To use this mode, set the turn_detection_mode to TurnDetectionMode.ADAPTIVE in your STT configuration. You must also remove any other VAD / smart turn features within Pipecat to ensure that there is not a conflict.
transport_params = TransportParams(
    audio_in_enabled=True,
    audio_out_enabled=True,
    # vad_analyzer=... <- REMOVE (use Speechmatics' built-in VAD)
    # turn_analyzer=... <- REMOVE (use Speechmatics' built-in end-of-turn detection)
)

...

stt = SpeechmaticsSTTService(
    api_key=os.getenv("SPEECHMATICS_API_KEY"),
    params=SpeechmaticsSTTService.InputParams(
        language=Language.EN,
        turn_detection_mode=SpeechmaticsSTTService.TurnDetectionMode.ADAPTIVE,
        speaker_active_format="<{speaker_id}>{text}</{speaker_id}>",
    ),
)

Smart Turn detection

Further to ADAPTIVE, Speechmatics also provides its own smart turn detection which combines VAD and the use of Smart Turn v3 from Pipecat. This can be enabled by setting the turn_detection_mode parameter to TurnDetectionMode.SMART_TURN.
transport_params = TransportParams(
    audio_in_enabled=True,
    audio_out_enabled=True,
    # vad_analyzer=... <- REMOVE (use Speechmatics' built-in VAD)
    # turn_analyzer=... <- REMOVE (use Speechmatics' built-in end-of-turn detection)
)

...

stt = SpeechmaticsSTTService(
    api_key=os.getenv("SPEECHMATICS_API_KEY"),
    params=SpeechmaticsSTTService.InputParams(
        language=Language.EN,
        turn_detection_mode=SpeechmaticsSTTService.TurnDetectionMode.SMART_TURN,
        speaker_active_format="<{speaker_id}>{text}</{speaker_id}>",
    ),
)

Speaker Diarization

Speechmatics STT supports speaker diarization, which separates out different speakers in the audio. The identity of each speaker is returned in the TranscriptionFrame objects in the user_id attribute. If speaker_active_format or speaker_passive_format are provided, then the text output for the TranscriptionFrame will be formatted to this specification. Your system context can then be updated to include information about this format to understand which speaker spoke which words. The passive format is optional and is used when the engine has been told to focus on specific speakers and other speakers will then be formatted using the speaker_passive_format format.
  • speaker_active_format -> the formatter for active speakers
  • speaker_passive_format -> the formatter for passive / background speakers
Examples:
  • <{speaker_id}>{text}</{speaker_id}> -> <S1>Good morning.</S1>.
  • @{speaker_id}: {text} -> @S1: Good morning..

Available attributes

AttributeDescriptionExample
speaker_idThe label of the speakerS1
text / contentThe transcribed textGood morning.
tsThe timestamp of the transcription2025-09-15T19:47:29.096+00:00
start_timeThe start time of the transcription segment0.0
end_timeThe end time of the transcription segment2.5
langThe language of the transcription segmenten

Speaker Lock

In conjunction with speaker diarization, it is possible to decide at the start or during a conversation to focus on a specific speaker, ignore or retain words from other speakers, or implicitly ignore one or more speakers altogether. In the example below, the following will happen:
  • S1 will be transcribed as normal and drive the end of turn and the conversation flow
  • S2 will be ignored completely
  • All other speakers’ words will be transcribed and emitted as tagged segments, but ONLY when a speaker in focus also speaks
What this means is that if S3 says “Hello”, then it is not until S1 speaks again that the transcription will be emitted.
stt = SpeechmaticsSTTService(
    api_key=os.getenv("SPEECHMATICS_API_KEY"),
    params=SpeechmaticsSTTService.InputParams(
        language=Language.EN,
        focus_speakers=["S1"],
        ignore_speakers=["S2"],
        focus_mode=SpeechmaticsSTTService.SpeakerFocusMode.RETAIN,
        speaker_active_format="<{speaker_id}>{text}</{speaker_id}>",
    ),
)

Language Support

Refer to the Speechmatics docs for more information on supported languages.
Speechmatics STT supports the following languages and regional variants. Setting a language can be done using the language parameter when creating the STT object. The exception to this is English / Mandarin which has the code cmn_en.
Language CodeDescriptionLocales
Language.ARArabic-
Language.BABashkir-
Language.EUBasque-
Language.BEBelarusian-
Language.BGBulgarian-
Language.BNBengali-
Language.YUECantonese-
Language.CACatalan-
Language.HRCroatian-
Language.CSCzech-
Language.DADanish-
Language.NLDutch-
Language.ENEnglishen-US, en-GB, en-AU
Language.EOEsperanto-
Language.ETEstonian-
Language.FAPersian-
Language.FIFinnish-
Language.FRFrench-
Language.GLGalician-
Language.DEGerman-
Language.ELGreek-
Language.HEHebrew-
Language.HIHindi-
Language.HUHungarian-
Language.IAInterlingua-
Language.ITItalian-
Language.IDIndonesian-
Language.GAIrish-
Language.JAJapanese-
Language.KOKorean-
Language.LVLatvian-
Language.LTLithuanian-
Language.MSMalay-
Language.MTMaltese-
Language.CMNMandarincmn-Hans, cmn-Hant
Language.MRMarathi-
Language.MNMongolian-
Language.NONorwegian-
Language.PLPolish-
Language.PTPortuguese-
Language.RORomanian-
Language.RURussian-
Language.SKSlovakian-
Language.SLSlovenian-
Language.ESSpanish-
Language.SVSwedish-
Language.SWSwahili-
Language.TATamil-
Language.THThai-
Language.TRTurkish-
Language.UGUyghur-
Language.UKUkrainian-
Language.URUrdu-
Language.VIVietnamese-
Language.CYWelsh-
For bilingual transcription, use the language and domain parameters as follows:
Language CodeDescriptionDomain Options
cmn_enEnglish / Mandarin-
en_msEnglish / Malay-
Language.ESEnglish / Spanishbilingual-en
en_taEnglish / Tamil-

Usage Examples

Examples are included in the Pipecat project: Sample projects:

Basic Configuration

Initialize the SpeechmaticsSTTService and use it in a pipeline:
from pipecat.services.speechmatics.stt import SpeechmaticsSTTService
from pipecat.transcriptions.language import Language

# Configure service
stt = SpeechmaticsSTTService(
    api_key="your-api-key",
    params=SpeechmaticsSTTService.InputParams(
        language=Language.FR,
    )
)

# Use in pipeline
pipeline = Pipeline([
    transport.input(),
    stt,
    context_aggregator.user(),
    llm,
    tts,
    transport.output(),
    context_aggregator.assistant()
])

With Diarization

This will enable diarization and also only go to the LLM if words are spoken from the first speaker (S1). Words from other speakers are transcribed but only sent when the first speaker speaks. When using the TurnDetectionMode.ADAPTIVE or TurnDetectionMode.SMART_TURN options, this will use the speaker diarization to determine when a speaker is speaking. You will need to disable VAD options within the selected transport object to ensure this works correctly (see 07b-interruptible-speechmatics-vad.py as an example). Initialize the SpeechmaticsSTTService and use it in a pipeline:
from pipecat.services.speechmatics.stt import SpeechmaticsSTTService
from pipecat.transcriptions.language import Language

# Configure service
stt = SpeechmaticsSTTService(
    api_key="your-api-key",
    params=SpeechmaticsSTTService.InputParams(
        language=Language.EN,
        turn_detection_mode=SpeechmaticsSTTService.TurnDetectionMode.ADAPTIVE,
        focus_speakers=["S1"],
        speaker_active_format="<{speaker_id}>{text}</{speaker_id}>",
        speaker_passive_format="<PASSIVE><{speaker_id}>{text}</{speaker_id}></PASSIVE>",
    )
)

# Use in pipeline
pipeline = Pipeline([
    transport.input(),
    stt,
    context_aggregator.user(),
    llm,
    tts,
    transport.output(),
    context_aggregator.assistant()
])

Additional Notes

  • Connection Management: Automatically handles WebSocket connections and reconnections
  • Sample Rate: The default sample rate of 16000 in pcm_s16le format
  • VAD Integration: Optionally supports Speechmatics’ built-in VAD and end of utterance detection

Event Handlers

In addition to the standard service connection events (on_connected, on_disconnected, on_connection_error), Speechmatics provides:
EventDescription
on_speakers_resultSpeaker identification result received
@stt.event_handler("on_speakers_result")
async def on_speakers_result(service, message):
    print(f"Speaker result: {message}")