Skip to main content

Overview

GrokRealtimeLLMService provides real-time, multimodal conversation capabilities using xAI’s Grok Voice Agent API. It supports speech-to-speech interactions with integrated LLM processing, function calling, and advanced conversation management with low-latency response times.

Installation

To use Grok Realtime services, install the required dependencies:
pip install "pipecat-ai[grok]"

Prerequisites

xAI Account Setup

Before using Grok Realtime services, you need:
  1. xAI Account: Sign up at xAI Console
  2. API Key: Generate a Grok API key from your account dashboard
  3. Model Access: Ensure access to Grok Voice Agent models
  4. Usage Limits: Configure appropriate usage limits and billing

Required Environment Variables

  • XAI_API_KEY: Your xAI API key for authentication

Key Features

  • Real-time Speech-to-Speech: Direct audio processing with low latency
  • Multilingual Support: Support for multiple languages
  • Voice Activity Detection: Server-side VAD for automatic speech detection
  • Function Calling: Seamless support for external functions and tool integration
  • Multiple Voice Options: Various voice personalities available
  • WebSocket Support: Real-time bidirectional audio streaming

Configuration

GrokRealtimeLLMService

api_key
str
required
xAI API key for authentication.
base_url
str
default:"wss://api.x.ai/v1/realtime"
WebSocket base URL for the Grok Realtime API. Override for custom deployments.
session_properties
SessionProperties
default:"None"
Configuration properties for the realtime session. If None, uses default SessionProperties with voice "Ara" and server-side VAD enabled. See SessionProperties below.
start_audio_paused
bool
default:"False"
Whether to start with audio input paused.

SessionProperties

Session-level configuration passed via the session_properties constructor argument. These settings can be updated during the session using LLMUpdateSettingsFrame.
ParameterTypeDefaultDescription
instructionsstrNoneSystem instructions for the assistant.
voiceLiteral["Ara", "Rex", "Sal", "Eve", "Leo"]"Ara"Voice the model uses to respond.
turn_detectionTurnDetectionTurnDetection(type="server_vad")Turn detection configuration. Set to None for manual turn detection.
audioAudioConfigurationNoneConfiguration for input and output audio formats.
toolsList[GrokTool]NoneAvailable tools: web_search, x_search, file_search, or custom function tools.

AudioConfiguration

The audio field in SessionProperties accepts an AudioConfiguration with input and output sub-configurations: AudioInput (audio.input):
ParameterTypeDefaultDescription
formatAudioFormatNoneInput audio format. Supports PCMAudioFormat (configurable rate), PCMUAudioFormat (8kHz), or PCMAAudioFormat (8kHz).
AudioOutput (audio.output):
ParameterTypeDefaultDescription
formatAudioFormatNoneOutput audio format. Same format options as input.
Grok PCM audio supports sample rates: 8000, 16000, 21050, 24000, 32000, 44100, and 48000 Hz.

Built-in Tools

Grok provides several built-in tools in addition to custom function tools:
ToolTypeDescription
WebSearchToolweb_searchSearch the web for current information
XSearchToolx_searchSearch X (Twitter) for posts. Supports allowed_x_handles filter.
FileSearchToolfile_searchSearch uploaded document collections by vector_store_ids

Usage

Basic Setup

import os
from pipecat.services.grok.realtime import GrokRealtimeLLMService

llm = GrokRealtimeLLMService(
    api_key=os.getenv("XAI_API_KEY"),
)

With Session Configuration

from pipecat.services.grok.realtime import GrokRealtimeLLMService
from pipecat.services.grok.realtime.events import (
    SessionProperties,
    TurnDetection,
    AudioConfiguration,
    AudioInput,
    AudioOutput,
    PCMAudioFormat,
)

session_properties = SessionProperties(
    instructions="You are a helpful assistant.",
    voice="Rex",
    turn_detection=TurnDetection(type="server_vad"),
    audio=AudioConfiguration(
        input=AudioInput(format=PCMAudioFormat(rate=16000)),
        output=AudioOutput(format=PCMAudioFormat(rate=16000)),
    ),
)

llm = GrokRealtimeLLMService(
    api_key=os.getenv("XAI_API_KEY"),
    session_properties=session_properties,
)

With Built-in Tools

from pipecat.services.grok.realtime.events import (
    SessionProperties,
    WebSearchTool,
    XSearchTool,
)

session_properties = SessionProperties(
    instructions="You are a helpful assistant with access to web search.",
    voice="Ara",
    tools=[
        WebSearchTool(),
        XSearchTool(allowed_x_handles=["@elonmusk"]),
    ],
)

llm = GrokRealtimeLLMService(
    api_key=os.getenv("XAI_API_KEY"),
    session_properties=session_properties,
)

Updating Settings at Runtime

from pipecat.frames.frames import LLMUpdateSettingsFrame

await task.queue_frame(
    LLMUpdateSettingsFrame(
        settings={
            "instructions": "Now speak in Spanish.",
            "voice": "Eve",
        }
    )
)

Notes

  • Audio format auto-configuration: If audio format is not specified in session_properties, the service automatically configures PCM input/output using the pipeline’s sample rates.
  • Server-side VAD: Enabled by default. When VAD is enabled, the server handles speech detection and turn management automatically. Set turn_detection to None to manage turns manually.
  • Audio before setup: Audio is not sent to Grok until the conversation setup is complete, preventing sample rate mismatches.
  • Available voices: Ara (default), Rex, Sal, Eve, and Leo.
  • G.711 support: PCMU and PCMA formats are supported at a fixed 8000 Hz rate, useful for telephony integrations.

Event Handlers

EventDescription
on_conversation_item_createdCalled when a new conversation item is created in the session
on_conversation_item_updatedCalled when a conversation item is updated or completed
@llm.event_handler("on_conversation_item_created")
async def on_item_created(service, item_id, item):
    print(f"New conversation item: {item_id}")

@llm.event_handler("on_conversation_item_updated")
async def on_item_updated(service, item_id, item):
    print(f"Conversation item updated: {item_id}")