Skip to main content

Overview

GeminiLiveLLMService enables natural, real-time conversations with Google’s Gemini model. It provides built-in audio transcription, voice activity detection, and context management for creating interactive AI experiences with multimodal capabilities including audio, video, and text processing.
Want to start building? Check out our Gemini Live Guide.

Installation

To use Gemini Live services, install the required dependencies:
pip install "pipecat-ai[google]"

Prerequisites

Google AI Setup

Before using Gemini Live services, you need:
  1. Google Account: Set up at Google AI Studio
  2. API Key: Generate a Gemini API key from AI Studio
  3. Model Access: Ensure access to Gemini Live models
  4. Multimodal Configuration: Set up audio, video, and text modalities

Required Environment Variables

  • GOOGLE_API_KEY: Your Google Gemini API key for authentication

Key Features

  • Multimodal Processing: Handle audio, video, and text inputs simultaneously
  • Real-time Streaming: Low-latency audio and video processing
  • Voice Activity Detection: Automatic speech detection and turn management
  • Function Calling: Advanced tool integration and API calling capabilities
  • Context Management: Intelligent conversation history and system instruction handling

Configuration

GeminiLiveLLMService

api_key
str
required
Google AI API key for authentication.
model
str
Gemini model identifier to use.
voice_id
str
default:"Charon"
TTS voice identifier for audio responses.
system_instruction
str
default:"None"
System prompt for the model. Can also be provided via the LLM context.
tools
List[dict] | ToolsSchema
default:"None"
Tools/functions available to the model. Can also be provided via the LLM context.
params
InputParams
default:"InputParams()"
Runtime-configurable generation and session settings. See InputParams below.
start_audio_paused
bool
default:"False"
Whether to start with audio input paused.
start_video_paused
bool
default:"False"
Whether to start with video input paused.
inference_on_context_initialization
bool
default:"True"
Whether to generate a response when context is first set. Set to False to wait for user input before the model responds.
http_options
HttpOptions
default:"None"
HTTP options for the Google API client. Use this to set API version (e.g. HttpOptions(api_version="v1alpha")) or other request options.
file_api_base_url
str
Base URL for the Gemini File API.

InputParams

Generation and session settings that can be set at initialization via the params constructor argument.
ParameterTypeDefaultDescription
frequency_penaltyfloatNoneFrequency penalty for generation (0.0-2.0).
max_tokensint4096Maximum tokens to generate.
presence_penaltyfloatNonePresence penalty for generation (0.0-2.0).
temperaturefloatNoneSampling temperature (0.0-2.0).
top_kintNoneTop-k sampling parameter.
top_pfloatNoneTop-p (nucleus) sampling parameter (0.0-1.0).
modalitiesGeminiModalitiesAUDIOResponse modality: GeminiModalities.AUDIO or GeminiModalities.TEXT.
languageLanguageEN_USLanguage for generation and transcription.
media_resolutionGeminiMediaResolutionUNSPECIFIEDMedia resolution for video input: UNSPECIFIED, LOW (64 tokens), MEDIUM (256 tokens), or HIGH (256 tokens with zoom).
vadGeminiVADParamsNoneVoice activity detection parameters. See GeminiVADParams below.
context_window_compressionContextWindowCompressionParamsNoneContext window compression settings.
thinkingThinkingConfigNoneThinking/reasoning configuration. Requires a model that supports it.
enable_affective_dialogboolNoneEnable affective dialog for expression and tone adaptation. Requires a supporting model and API version (e.g. v1alpha).
proactivityProactivityConfigNoneProactivity settings for model behavior. Requires a supporting model and API version.
extraDict[str, Any]{}Additional parameters passed to the API.

GeminiVADParams

Voice activity detection configuration passed via InputParams.vad:
ParameterTypeDefaultDescription
disabledboolNoneWhether to disable server-side VAD entirely.
start_sensitivityStartSensitivityNoneSensitivity for speech start detection.
end_sensitivityEndSensitivityNoneSensitivity for speech end detection.
prefix_padding_msintNonePadding before speech starts in milliseconds.
silence_duration_msintNoneSilence duration threshold in milliseconds to detect speech end.

ContextWindowCompressionParams

ParameterTypeDefaultDescription
enabledboolFalseWhether context window compression is enabled.
trigger_tokensintNoneToken count to trigger compression. None uses the default (80% of context window).

Usage

Basic Setup

import os
from pipecat.services.google.gemini_live import GeminiLiveLLMService

llm = GeminiLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    voice_id="Charon",
    system_instruction="You are a helpful assistant.",
)

With Custom Parameters

from pipecat.services.google.gemini_live import (
    GeminiLiveLLMService,
    InputParams,
    GeminiVADParams,
    ContextWindowCompressionParams,
)
from pipecat.transcriptions.language import Language

llm = GeminiLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    model="models/gemini-2.5-flash-native-audio-preview-12-2025",
    voice_id="Puck",
    system_instruction="You are a helpful assistant.",
    params=InputParams(
        temperature=0.7,
        max_tokens=2048,
        language=Language.EN_US,
        vad=GeminiVADParams(
            silence_duration_ms=500,
        ),
        context_window_compression=ContextWindowCompressionParams(
            enabled=True,
        ),
    ),
)

Text-Only Mode

from pipecat.services.google.gemini_live import (
    GeminiLiveLLMService,
    InputParams,
    GeminiModalities,
)

llm = GeminiLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    system_instruction="You are a helpful assistant.",
    params=InputParams(
        modalities=GeminiModalities.TEXT,
    ),
)

With Thinking Enabled

from google.genai.types import ThinkingConfig

llm = GeminiLiveLLMService(
    api_key=os.getenv("GOOGLE_API_KEY"),
    model="models/gemini-2.5-flash-native-audio-preview-12-2025",
    system_instruction="You are a helpful assistant.",
    params=InputParams(
        thinking=ThinkingConfig(include_thoughts=True),
    ),
)

Notes

  • System instruction precedence: If a system instruction is provided both at init time and in the LLM context, the context-provided value takes precedence.
  • Tools precedence: Similarly, tools provided in the context override tools provided at init time.
  • Transcription aggregation: Gemini Live sends user transcriptions in small chunks. The service aggregates them into complete sentences using end-of-sentence detection with a 0.5-second timeout fallback.
  • Session resumption: The service automatically handles session resumption on reconnection using session resumption handles.
  • Connection resilience: The service will attempt up to 3 consecutive reconnections before treating a connection failure as fatal.
  • Video frame rate: Video frames are throttled to a maximum of one per second.
  • Affective dialog and proactivity: These features require both a supporting model and API version (v1alpha).