Overview
GeminiLiveLLMService enables natural, real-time conversations with Google’s Gemini model. It provides built-in audio transcription, voice activity detection, and context management for creating interactive AI experiences with multimodal capabilities including audio, video, and text processing.
Gemini Live API Reference
Pipecat’s API methods for Gemini Live integration
Example Implementation
Complete Gemini Live function calling example
Gemini Documentation
Official Google Gemini Live API documentation
Gemini Live Model Card
Gemini Live available models
Installation
To use Gemini Live services, install the required dependencies:Prerequisites
Google AI Setup
Before using Gemini Live services, you need:- Google Account: Set up at Google AI Studio
- API Key: Generate a Gemini API key from AI Studio
- Model Access: Ensure access to Gemini Live models
- Multimodal Configuration: Set up audio, video, and text modalities
Required Environment Variables
GOOGLE_API_KEY: Your Google Gemini API key for authentication
Key Features
- Multimodal Processing: Handle audio, video, and text inputs simultaneously
- Real-time Streaming: Low-latency audio and video processing
- Voice Activity Detection: Automatic speech detection and turn management
- Function Calling: Advanced tool integration and API calling capabilities
- Context Management: Intelligent conversation history and system instruction handling
Configuration
GeminiLiveLLMService
Google AI API key for authentication.
Gemini model identifier to use.
TTS voice identifier for audio responses.
System prompt for the model. Can also be provided via the LLM context.
Tools/functions available to the model. Can also be provided via the LLM context.
Runtime-configurable generation and session settings. See InputParams below.
Whether to start with audio input paused.
Whether to start with video input paused.
Whether to generate a response when context is first set. Set to
False to wait for user input before the model responds.HTTP options for the Google API client. Use this to set API version (e.g.
HttpOptions(api_version="v1alpha")) or other request options.Base URL for the Gemini File API.
InputParams
Generation and session settings that can be set at initialization via theparams constructor argument.
| Parameter | Type | Default | Description |
|---|---|---|---|
frequency_penalty | float | None | Frequency penalty for generation (0.0-2.0). |
max_tokens | int | 4096 | Maximum tokens to generate. |
presence_penalty | float | None | Presence penalty for generation (0.0-2.0). |
temperature | float | None | Sampling temperature (0.0-2.0). |
top_k | int | None | Top-k sampling parameter. |
top_p | float | None | Top-p (nucleus) sampling parameter (0.0-1.0). |
modalities | GeminiModalities | AUDIO | Response modality: GeminiModalities.AUDIO or GeminiModalities.TEXT. |
language | Language | EN_US | Language for generation and transcription. |
media_resolution | GeminiMediaResolution | UNSPECIFIED | Media resolution for video input: UNSPECIFIED, LOW (64 tokens), MEDIUM (256 tokens), or HIGH (256 tokens with zoom). |
vad | GeminiVADParams | None | Voice activity detection parameters. See GeminiVADParams below. |
context_window_compression | ContextWindowCompressionParams | None | Context window compression settings. |
thinking | ThinkingConfig | None | Thinking/reasoning configuration. Requires a model that supports it. |
enable_affective_dialog | bool | None | Enable affective dialog for expression and tone adaptation. Requires a supporting model and API version (e.g. v1alpha). |
proactivity | ProactivityConfig | None | Proactivity settings for model behavior. Requires a supporting model and API version. |
extra | Dict[str, Any] | {} | Additional parameters passed to the API. |
GeminiVADParams
Voice activity detection configuration passed viaInputParams.vad:
| Parameter | Type | Default | Description |
|---|---|---|---|
disabled | bool | None | Whether to disable server-side VAD entirely. |
start_sensitivity | StartSensitivity | None | Sensitivity for speech start detection. |
end_sensitivity | EndSensitivity | None | Sensitivity for speech end detection. |
prefix_padding_ms | int | None | Padding before speech starts in milliseconds. |
silence_duration_ms | int | None | Silence duration threshold in milliseconds to detect speech end. |
ContextWindowCompressionParams
| Parameter | Type | Default | Description |
|---|---|---|---|
enabled | bool | False | Whether context window compression is enabled. |
trigger_tokens | int | None | Token count to trigger compression. None uses the default (80% of context window). |
Usage
Basic Setup
With Custom Parameters
Text-Only Mode
With Thinking Enabled
Notes
- System instruction precedence: If a system instruction is provided both at init time and in the LLM context, the context-provided value takes precedence.
- Tools precedence: Similarly, tools provided in the context override tools provided at init time.
- Transcription aggregation: Gemini Live sends user transcriptions in small chunks. The service aggregates them into complete sentences using end-of-sentence detection with a 0.5-second timeout fallback.
- Session resumption: The service automatically handles session resumption on reconnection using session resumption handles.
- Connection resilience: The service will attempt up to 3 consecutive reconnections before treating a connection failure as fatal.
- Video frame rate: Video frames are throttled to a maximum of one per second.
- Affective dialog and proactivity: These features require both a supporting model and API version (
v1alpha).