Overview
GrokRealtimeLLMService provides real-time, multimodal conversation capabilities using xAI’s Grok Voice Agent API. It supports speech-to-speech interactions with integrated LLM processing, function calling, and advanced conversation management with low-latency response times.
Grok Realtime API Reference
Pipecat’s API methods for Grok Realtime integration
Example Implementation
Complete Grok Realtime conversation example
Grok Voice Documentation
Official xAI Grok Voice Agent API documentation
xAI Console
Access Grok models and manage API keys
Installation
To use Grok Realtime services, install the required dependencies:Prerequisites
xAI Account Setup
Before using Grok Realtime services, you need:- xAI Account: Sign up at xAI Console
- API Key: Generate a Grok API key from your account dashboard
- Model Access: Ensure access to Grok Voice Agent models
- Usage Limits: Configure appropriate usage limits and billing
Required Environment Variables
XAI_API_KEY: Your xAI API key for authentication
Key Features
- Real-time Speech-to-Speech: Direct audio processing with low latency
- Multilingual Support: Support for multiple languages
- Voice Activity Detection: Server-side VAD for automatic speech detection
- Function Calling: Seamless support for external functions and tool integration
- Multiple Voice Options: Various voice personalities available
- WebSocket Support: Real-time bidirectional audio streaming
Configuration
GrokRealtimeLLMService
xAI API key for authentication.
WebSocket base URL for the Grok Realtime API. Override for custom deployments.
Configuration properties for the realtime session. If
None, uses default SessionProperties with voice "Ara" and server-side VAD enabled. See SessionProperties below.Whether to start with audio input paused.
SessionProperties
Session-level configuration passed via thesession_properties constructor argument. These settings can be updated during the session using LLMUpdateSettingsFrame.
| Parameter | Type | Default | Description |
|---|---|---|---|
instructions | str | None | System instructions for the assistant. |
voice | Literal["Ara", "Rex", "Sal", "Eve", "Leo"] | "Ara" | Voice the model uses to respond. |
turn_detection | TurnDetection | TurnDetection(type="server_vad") | Turn detection configuration. Set to None for manual turn detection. |
audio | AudioConfiguration | None | Configuration for input and output audio formats. |
tools | List[GrokTool] | None | Available tools: web_search, x_search, file_search, or custom function tools. |
AudioConfiguration
Theaudio field in SessionProperties accepts an AudioConfiguration with input and output sub-configurations:
AudioInput (audio.input):
| Parameter | Type | Default | Description |
|---|---|---|---|
format | AudioFormat | None | Input audio format. Supports PCMAudioFormat (configurable rate), PCMUAudioFormat (8kHz), or PCMAAudioFormat (8kHz). |
audio.output):
| Parameter | Type | Default | Description |
|---|---|---|---|
format | AudioFormat | None | Output audio format. Same format options as input. |
Built-in Tools
Grok provides several built-in tools in addition to custom function tools:| Tool | Type | Description |
|---|---|---|
WebSearchTool | web_search | Search the web for current information |
XSearchTool | x_search | Search X (Twitter) for posts. Supports allowed_x_handles filter. |
FileSearchTool | file_search | Search uploaded document collections by vector_store_ids |
Usage
Basic Setup
With Session Configuration
With Built-in Tools
Updating Settings at Runtime
Notes
- Audio format auto-configuration: If audio format is not specified in
session_properties, the service automatically configures PCM input/output using the pipeline’s sample rates. - Server-side VAD: Enabled by default. When VAD is enabled, the server handles speech detection and turn management automatically. Set
turn_detectiontoNoneto manage turns manually. - Audio before setup: Audio is not sent to Grok until the conversation setup is complete, preventing sample rate mismatches.
- Available voices: Ara (default), Rex, Sal, Eve, and Leo.
- G.711 support: PCMU and PCMA formats are supported at a fixed 8000 Hz rate, useful for telephony integrations.
Event Handlers
| Event | Description |
|---|---|
on_conversation_item_created | Called when a new conversation item is created in the session |
on_conversation_item_updated | Called when a conversation item is updated or completed |