VideoSDK AI Agents

Open-source framework for building real-time multimodal conversational AI agents.

The VideoSDK AI Agents framework connects your infrastructure, agent worker, VideoSDK room, and user devices, enabling real-time, natural voice and multimodal interactions between users and intelligent agents.

Overview

The AI Agent SDK is a Python framework built on top of the VideoSDK Python SDK that enables AI-powered agents to join VideoSDK rooms as participants. This SDK serves as a real-time bridge between AI models (like OpenAI or Gemini) and your users, facilitating seamless voice and media interactions.

🎙️ Agent with Cascading Pipeline Test an AI Voice Agent that uses a Cascading Pipeline for STT → LLM → TTS.	📞 AI Telephony Agent Test an AI Agent that answers and interacts over phone calls using SIP.
💻 Agent Documentation The VideoSDK Agent Official Documentation.	📚 SDK Reference Reference Docs for Agents Framework.

#	Feature	Description
1	🎤 Real-time Communication (Audio/Video)	Agents can listen, speak, and interact live in meetings.
2	📞 SIP & Telephony Integration	Seamlessly connect agents to phone systems via SIP for call handling, routing, and PSTN access.
3	🧍 Virtual Avatars	Add lifelike avatars to enhance interaction and presence using Simli.
4	🤖 Multi-Model Support	Integrate with OpenAI, Gemini, AWS NovaSonic, and more.
5	🧩 Cascading Pipeline	Integrates with different providers of STT, LLM, and TTS seamlessly.
6	⚡ Realtime Pipeline	Use unified realtime models (OpenAI Realtime, AWS Nova, Gemini Live) for lowest latency
7	🧠 Conversational Flow	Manages turn detection and VAD for smooth interactions.
8	🛠️ Function Tools	Extend agent capabilities with event scheduling, expense tracking, and more.
9	🌐 MCP Integration	Connect agents to external data sources and tools using Model Context Protocol.
10	🔗 A2A Protocol	Enable agent-to-agent interactions for complex workflows.
11	📊 Observability	Built-in OpenTelemetry tracing and metrics collection
12	🚀 CLI Tool	Run agents locally and test with `videosdk` CLI

Important

Star VideoSDK Repositories ⭐️

Get instant notifications for new releases and updates. Your support helps us grow and improve VideoSDK!

Pre-requisites

Before you begin, ensure you have:

A VideoSDK authentication token (generate from app.videosdk.live)
- A VideoSDK meeting ID (you can generate one using the Create Room API or through the VideoSDK dashboard)
Python 3.12 or higher
Third-Party API Keys:
- API keys for the services you intend to use (e.g., OpenAI for LLM/STT/TTS, ElevenLabs for TTS, Google for Gemini etc.).

Installation

Create and activate a virtual environment with Python 3.12 or higher.
macOS / Linux
```
python3 -m venv venv
source venv/bin/activate
```
Windows
```
python -m venv venv
venv\Scripts\activate
```
Install the core VideoSDK AI Agent package
```
pip install videosdk-agents
```
Install Optional Plugins. Plugins help integrate different providers for Realtime, STT, LLM, TTS, and more. Install what your use case needs:
```
# Example: Install the Turn Detector plugin
pip install videosdk-plugins-turn-detector
```
👉 Supported plugins (Realtime, LLM, STT, TTS, VAD, Avatar, SIP) are listed in the Supported Libraries section below.

Generating a VideoSDK Meeting ID

Before your AI agent can join a meeting, you'll need to create a meeting ID. You can generate one using the VideoSDK Create Room API:

Using cURL

curl -X POST https://api.videosdk.live/v2/rooms \
  -H "Authorization: YOUR_JWT_TOKEN_HERE" \
  -H "Content-Type: application/json"

For more details on the Create Room API, refer to the VideoSDK documentation.

Getting Started: Your First Agent

Quick Start

Now that you've installed the necessary packages, you're ready to build!

Step 1: Creating a Custom Agent

First, let's create a custom voice agent by inheriting from the base Agent class:

from videosdk.agents import Agent, function_tool

# External Tool
# async def get_weather(self, latitude: str, longitude: str):

class VoiceAgent(Agent):
    def __init__(self):
        super().__init__(
            instructions="You are a helpful voice assistant that can answer questions and help with tasks.",
             tools=[get_weather] # You can register any external tool defined outside of this scope
        )

    async def on_enter(self) -> None:
        """Called when the agent first joins the meeting"""
        await self.session.say("Hi there! How can I help you today?")
    
    async def on_exit(self) -> None:
      """Called when the agent exits the meeting"""
        await self.session.say("Goodbye!")

This code defines a basic voice agent with:

Custom instructions that define the agent's personality and capabilities
An entry message when joining a meeting
State change handling to track the agent's current activity

Step 2: Implementing Function Tools

Function tools allow your agent to perform actions beyond conversation. There are two ways to define tools:

External Tools: Defined as standalone functions outside the agent class and registered via the tools argument in the agent's constructor.
Internal Tools: Defined as methods inside the agent class and decorated with @function_tool.

Below is an example of both:

import aiohttp

# External Function Tools
@function_tool
def get_weather(latitude: str, longitude: str):
    print(f"Getting weather for {latitude}, {longitude}")
    url = f"https://api.open-meteo.com/v1/forecast?latitude={latitude}&longitude={longitude}&current=temperature_2m"
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
            if response.status == 200:
                data = await response.json()
                return {
                    "temperature": data["current"]["temperature_2m"],
                    "temperature_unit": "Celsius",
                }
            else:
                raise Exception(
                    f"Failed to get weather data, status code: {response.status}"
                )

class VoiceAgent(Agent):
# ... previous code ...
# Internal Function Tools
    @function_tool
    async def get_horoscope(self, sign: str) -> dict:
        horoscopes = {
            "Aries": "Today is your lucky day!",
            "Taurus": "Focus on your goals today.",
            "Gemini": "Communication will be important today.",
        }
        return {
            "sign": sign,
            "horoscope": horoscopes.get(sign, "The stars are aligned for you today!"),
        }

Use external tools for reusable, standalone functions (registered via tools=[...]).
Use internal tools for agent-specific logic as class methods.
Both must be decorated with @function_tool for the agent to recognize and use them.

Step 3: Setting Up the Pipeline

The pipeline connects your agent to an AI model. Here, we are using Google's Gemini for a Real-time Pipeline. You could also use a Cascading Pipeline.

from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig
from videosdk.agents import RealTimePipeline, JobContext

async def start_session(context: JobContext):
    # Initialize the AI model
    model = GeminiRealtime(
        model="gemini-2.0-flash-live-001",
        # When GOOGLE_API_KEY is set in .env - DON'T pass api_key parameter
        api_key="AKZSXXXXXXXXXXXXXXXXXXXX",
        config=GeminiLiveConfig(
            voice="Leda", # Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, and Zephyr.
            response_modalities=["AUDIO"]
        )
    )

    pipeline = RealTimePipeline(model=model)

    # Continue to the next steps...

Step 4: Assembling and Starting the Agent Session

Now, let's put everything together and start the agent session:

import asyncio
from videosdk.agents import AgentSession, WorkerJob, RoomOptions, JobContext

async def start_session(context: JobContext):
    # ... previous setup code ...

    # Create the agent session
    session = AgentSession(
        agent=VoiceAgent(),
        pipeline=pipeline
    )

    try:
       await context.connect()
        # Start the session
        await session.start()
        # Keep the session running until manually terminated
        await asyncio.Event().wait()
    finally:
        # Clean up resources when done
        await session.close()
        await context.shutdown()

def make_context() -> JobContext:
    room_options = RoomOptions(
        room_id="<meeting_id>", # Replace it with your actual meetingID
        auth_token = "<VIDEOSDK_AUTH_TOKEN>", # When VIDEOSDK_AUTH_TOKEN is set in .env - DON'T include videosdk_auth
        name="Test Agent", 
        playground=True,
        # vision= True # Only available when using the Google Gemini Live API
    )
    
    return JobContext(room_options=room_options)

if __name__ == "__main__":
    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
    job.start()

Step 5: Connecting with VideoSDK Client Applications

After setting up your AI Agent, you'll need a client application to connect with it. You can use any of the VideoSDK quickstart examples to create a client that joins the same meeting:

When setting up your client application, make sure to use the same meeting ID that your AI Agent is using.

Step 6: Running the Project

Once you have completed the setup, you can run your AI Voice Agent project using Python. Make sure your .env file is properly configured and all dependencies are installed.

python main.py

Tip

Test Your Agent Instantly with the CLI Tool

Run your agent locally using:

python main.py console

Experience real-time interactions right from your terminal - no meeting room required!
Speak and listen through your system’s mic and speakers for quick testing and rapid development.

Step 7: Deployment

For deployment options and guide, checkout the official documentation here: Deployment

Supported Libraries and Plugins

The framework supports integration with various AI models and tools, across multiple categories:

Category	Services
Real-time Models	OpenAI \| Gemini \| AWS Nova Sonic \| Azure Voice Live
Speech-to-Text (STT)	OpenAI \| Google \| Azure AI Speech \| Azure OpenAI \| Sarvam AI \| Deepgram \| Cartesia \| AssemblyAI \| Navana
Language Models (LLM)	OpenAI \| Azure OpenAI \| Google \| Sarvam AI \| Anthropic \| Cerebras
Text-to-Speech (TTS)	OpenAI \| Google \| AWS Polly \| Azure AI Speech \| Azure OpenAI \| Deepgram \| Sarvam AI \| ElevenLabs \| Cartesia \| Resemble AI \| Smallest AI \| Speechify \| InWorld \| Neuphonic \| Rime AI \| Hume AI \| Groq \| LMNT AI \| Papla Media
Voice Activity Detection (VAD)	SileroVAD
Turn Detection Model	Namo Turn Detector
Virtual Avatar	Simli
Denoise	RNNoise

Tip

Installation Examples

# Install with specific plugins
pip install videosdk-agents[openai,elevenlabs,silero]

# Install individual plugins
pip install videosdk-plugins-anthropic
pip install videosdk-plugins-deepgram

Examples

Explore the following examples to see the framework in action:

🤖 AI Voice Agent Usecases

📞 AI Telephony Agent Quickstart Use case: Hospital appointment booking via a voice-enabled agent.	✈️ AI Whatsapp Agent Quickstart Use case: Ask about available hotel rooms and book on the go.
👨‍🏫 Multi Agent System Use case: Customer care agent that transfers loan related to queries to Loan Specialist Agent.	🛒 Agent with Knowledge (RAG) Use case: Agent that answers questions based on documentation knowledge.
👨‍🏫 Agent with MCP Server Use case: Stock Market Analyst Agent with realtime Market Data Access.	🛒 Virtual Avatar Agent Use case: A Virtual Avatar Agent that presents weather forecast.

Documentation

For comprehensive guides and API references:

📄 Official Documentation

Complete framework documentation

📝 API Reference

Detailed API documentation

📂 Examples Directory

Additional code examples

Contributing

We welcome contributions! Here's how you can help:

🐞 Report Issues

Open an issue for bugs or feature requests

🔀 Submit PRs

Create a pull request with improvements

🛠️ Build Plugins

Follow our plugin development guide

💬 Join Community

Connect with us on Discord

The framework is under active development, so contributions in the form of new plugins, features, bug fixes, or documentation improvements are highly appreciated.

🛠️ Building Custom Plugins

Want to integrate a new AI provider? Check out BUILD YOUR OWN PLUGIN for:

Step-by-step plugin creation guide
Directory structure and file requirements
Implementation examples for STT, LLM, and TTS
Testing and submission guidelines

Community & Support

Stay connected with VideoSDK:

💬 Discord

Join our community

🐦 Twitter

@video_sdk

▶️ YouTube

VideoSDK Channel

🔗 LinkedIn

VideoSDK Company

Tip

Support the Project! ⭐️
Star the repository, join the community, and help us improve VideoSDK by providing feedback, reporting bugs, or contributing plugins.

Made with ❤️ by The VideoSDK Team

Name		Name	Last commit message	Last commit date
Latest commit History 247 Commits
.github		.github
examples		examples
scripts		scripts
videosdk-agents		videosdk-agents
videosdk-plugins		videosdk-plugins
.gitignore		.gitignore
BUILD_YOUR_OWN_PLUGIN.md		BUILD_YOUR_OWN_PLUGIN.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
NOTICE.txt		NOTICE.txt
README.md		README.md
requirements.txt		requirements.txt
setup.sh		setup.sh

License

videosdk-live/agents

Folders and files

Latest commit

History

Repository files navigation

VideoSDK AI Agents

Overview

🎙️ Agent with Cascading Pipeline

📞 AI Telephony Agent

💻 Agent Documentation

📚 SDK Reference

Pre-requisites

Installation

Generating a VideoSDK Meeting ID

Using cURL

Getting Started: Your First Agent

Quick Start

Step 1: Creating a Custom Agent

Step 2: Implementing Function Tools

Step 3: Setting Up the Pipeline

Step 4: Assembling and Starting the Agent Session

Step 5: Connecting with VideoSDK Client Applications

Step 6: Running the Project

Step 7: Deployment

Supported Libraries and Plugins

Examples

🤖 AI Voice Agent Usecases

📞 AI Telephony Agent Quickstart

✈️ AI Whatsapp Agent Quickstart

👨‍🏫 Multi Agent System

🛒 Agent with Knowledge (RAG)

👨‍🏫 Agent with MCP Server

🛒 Virtual Avatar Agent

Documentation

📄 Official Documentation

📝 API Reference

📂 Examples Directory

Contributing

🐞 Report Issues

🔀 Submit PRs

🛠️ Build Plugins

💬 Join Community

🛠️ Building Custom Plugins

Community & Support

💬 Discord

🐦 Twitter

▶️ YouTube

🔗 LinkedIn

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 34

Contributors 11

Languages