voiceai

Banner Image

English version 中文版本

A curated, developer friendly learning path for building real-time voice AI agents from your first STT call to scaling production telephony.

Voice AI has moved from research demos into shipping product in under three years. The modern stack is converging around a clear pattern: a real-time transport layer (WebRTC or telephony), a streaming pipeline of speech-to-text → LLM → text-to-speech, and a turn-taking model that decides when the agent should speak. This list is structured to mirror that learning order start with the foundations, pick a framework, then drill into individual components and production concerns.

Resources are tagged 🟢 Beginner, 🟡 Intermediate, or 🔴 Advanced. Prefer free official docs and vendor-neutral guides; flag where authors have commercial interests.

How to use this list

Read top-to-bottom if you’re brand new. The recommended path:

Foundations → understand the pipeline and latency budget
Frameworks → pick one (LiveKit Agents or Pipecat are the safest open-source bets) and ship a hello-world
Components (STT, TTS, LLM, VAD, turn detection) → swap pieces to learn what each layer does
Transport & telephony → connect to a real phone number
Evaluation, production, ethics → make it safe enough to ship

📘 Companion book: Voice Agents Handbook

If you want this material in a tighter, opinionated, production-grade form, I wrote the Voice Agents Handbook: building production voice AI with LiveKit, plus appendices on choosing your stack and the LiveKit ecosystem beyond agents. Ships June 1, 2026 on Kindle.

The README you’re reading collects the field’s best free resources. The book is the curated path through them, with the patterns I’ve used shipping voice agents for trade people, lawyers, and immigration consultants.

Disclosure: I maintain this repo and authored the handbook. Free sample (Introduction + Chapter 1) at handbook.mahimai.ca.

1. Foundational concepts and learning paths

Start here. These resources establish the mental model of the voice agent pipeline and the latency budget you’ll fight for the rest of your career.

Voice AI & Voice Agents An Illustrated Primer Kwindla Hultman Kramer’s free, regularly-updated long-form primer. The de facto textbook for the field. 🟢 Beginner
Voice Agent Architecture: STT, LLM, and TTS Pipelines Explained (LiveKit) Visual walkthrough of streaming patterns, turn detection, and where latency accumulates. 🟢 Beginner
Everything You Need to Know About Voice AI Agents (Deepgram) End-to-end primer covering feature extraction, ASR, LLM reasoning, and synthesis. 🟢 Beginner
AI Voice Agents (LiveKit Docs) The canonical “what is a voice agent” reference, covering pipeline vs multimodal and agent state. 🟢 Beginner
Core Latency in AI Voice Agents (Twilio) Visual explanation of end-of-turn detection, silence thresholds, and smart endpointing. 🟢 Beginner
Advice on Building Voice AI in June 2025 (Daily.co) Practical P50/P95 latency-budget guidance from Pipecat’s creators. 🟡 Intermediate
How Intelligent Turn Detection Solves the Biggest Challenge in Voice Agents (AssemblyAI) Endpointing is the most underestimated problem; this is the clearest deep-dive. 🟡 Intermediate

2. Frameworks and orchestration platforms

The frameworks below all let you wire STT, an LLM, and TTS together. For open-source production work, LiveKit Agents and Pipecat are the two safest bets; for managed dashboards, Vapi, Retell, and Bland win on time-to-first-call.

Open-source frameworks

LiveKit Agents Voice AI Quickstart Working assistant in <10 min via Python or TypeScript, runs on top of WebRTC. 🟢 Beginner
Pipecat Quickstart Scaffolds a Deepgram + OpenAI + Cartesia pipeline you can talk to in the browser in 5 minutes. 🟢 Beginner
Ultravox (fixie-ai/ultravox) Open-weight multimodal speech LLM (Llama/Gemma/Qwen variants) that skips the separate ASR stage for ~150 ms TTFT. 🔴 Advanced

Managed platforms

Vapi Quickstart Dashboard-first; ship an agent on a free US phone number in under 5 minutes. 🟢 Beginner
Retell AI Introduction & Quickstart Phone-agent platform with $10 free credit on signup. 🟢 Beginner
Bland AI Send Your First Phone Call Minimal API tutorial for placing your first AI phone call. 🟢 Beginner
ElevenLabs Conversational AI Quickstart Build and embed a voice agent widget on any website in 5 minutes. 🟢 Beginner

Realtime / speech-to-speech APIs

OpenAI Realtime API Guide Official guide to gpt-realtime over WebRTC, WebSockets, or SIP. 🟡 Intermediate
Google Gemini Live API Overview Low-latency, bidirectional voice + vision agents with barge-in and tool use. 🟡 Intermediate
Twilio ConversationRelay WebSocket bridge that handles STT/TTS so you focus on LLM logic; works with any LLM. 🟡 Intermediate

Vendor-neutral comparisons

Vapi vs Pipecat vs LiveKit (AssemblyAI) Architecture-focused comparison of pipeline control and transport choices. 🟡 Intermediate
11 Voice Agent Platforms Compared (Softcery) Broad market map with use-case recommendations. 🟢 Beginner
Best Voice Agent Stack (Hamming AI) Buy-vs-build framework with concrete cost, latency, and time-to-launch numbers. 🟡 Intermediate

3. Speech-to-text (STT / ASR)

Pick one streaming STT and learn it deeply before shopping around. Deepgram, AssemblyAI, and Whisper-derivatives cover most use cases.

Commercial APIs

Deepgram Nova-3 STT benchmarks Primer on WER, latency, and cost alongside Deepgram’s product reference. 🟢 Beginner
AssemblyAI Universal-Streaming Streaming STT walkthrough that doubles as a function-calling tutorial. 🟡 Intermediate
OpenAI Whisper / gpt-4o-transcribe API docs Easiest cloud STT if you already use OpenAI. 🟢 Beginner
Soniox multilingual benchmark Public WER comparison across 60 languages. 🟢 Beginner
Cartesia Ink Streaming STT paired with Sonic TTS for a single-vendor low-latency stack. 🟢 Beginner

Open source

openai/whisper The original repo and the de facto starting point for any DIY ASR project. 🟢 Beginner
SYSTRAN/faster-whisper CTranslate2 reimplementation up to 4× faster with INT8; recommended for self-hosted Whisper. 🟡 Intermediate
NVIDIA NeMo (Parakeet / Canary) Top-of-leaderboard open ASR models with streaming inference recipes. 🔴 Advanced
Moonshine Tiny on-device ASR (~190 MB) optimized for live streaming on edge devices. 🟡 Intermediate

Benchmarks and explainers

Open ASR Leaderboard (HuggingFace) Community leaderboard across 11 datasets your reference for open-source picks. 🟢 Beginner
Artificial Analysis Speech-to-Text Independent leaderboard ranking 48+ STT providers by WER, speed, and cost. 🟢 Beginner
Streaming vs Batch ASR (Arun Baby) Engineer-friendly explainer of RNN-T and Conformer streaming architectures. 🟡 Intermediate

4. Text-to-speech (TTS)

Latency, not raw quality, is what kills voice agents prioritize providers offering true streaming with first-byte under 200 ms.

Commercial APIs

ElevenLabs Docs Industry-leading quality, voice cloning, and Conversational AI in one SDK. 🟢 Beginner
Cartesia Sonic Quickstart Sub-100 ms first-byte latency, designed specifically for voice agents. 🟢 Beginner
Deepgram Aura Low-latency streaming TTS that pairs cleanly with Deepgram STT. 🟢 Beginner
OpenAI TTS (gpt-4o-mini-tts) Easiest plug-in TTS for the OpenAI stack. 🟢 Beginner
Artificial Analysis TTS leaderboard ELO, price, and speed comparison covering Rime, PlayHT, Hume, Inworld, and others. 🟢 Beginner

Open source

Coqui TTS (idiap fork) Maintained fork of Coqui-TTS / XTTS v2; the most battle-tested OSS TTS toolkit. 🟡 Intermediate
Piper (OHF-Voice/piper1-gpl) Fast local neural TTS optimized for Raspberry Pi; perfect for offline projects. 🟢 Beginner
Kokoro 82M Tiny Apache-licensed model that tops community ELO arenas; runs on CPU. 🟢 Beginner
F5-TTS Diffusion-transformer TTS with high-quality zero-shot voice cloning. 🟡 Intermediate
Orpheus-TTS Llama-3B-based emotive TTS with ~200 ms streaming and emotion tags. 🟡 Intermediate
Sesame CSM Conversational, context-aware multi-speaker TTS using a Llama backbone with the Mimi codec. 🔴 Advanced

Streaming and ethics

Streaming TTS for Low-Latency Agents (Picovoice) Clear taxonomy of single, output-streaming, and dual-streaming TTS. 🟡 Intermediate
Ethics of Voice Cloning & Deepfakes (Deepgram) Vendor-neutral discussion of misuse, regulation, and developer responsibility. 🟢 Beginner

5. LLMs for voice and real-time AI

A voice agent’s perceived intelligence is bounded by how fast the LLM streams its first token. Sub-300 ms TTFT changes the conversation feel entirely.

Low-latency inference

Groq LPU-based inference cloud delivering ~10× faster Llama tokens/sec than commodity GPUs. 🟢 Beginner
Cerebras Inference Wafer-scale chip inference with very high throughput on Llama models. 🟢 Beginner
SambaNova Cloud Reconfigurable Dataflow inference; stable throughput at low latency. 🟢 Beginner

Speech-to-speech models

OpenAI Realtime API guide Flagship S2S product with WebRTC/WebSocket transport. 🟡 Intermediate
Google Gemini Live Real-time multimodal voice/video with barge-in and 70-language support. 🟡 Intermediate
Moshi (kyutai-labs) Open-source full-duplex speech-text foundation model with 200 ms latency the premier OSS S2S model to study. 🔴 Advanced

Voice-specific prompting and tools

OpenAI Voice Agents Guide Compares chained vs S2S architectures with prompt and tool best practices. 🟢 Beginner
ElevenLabs Voice Agent Prompting Guide Production-grade prompt structure tuned for voice; vendor-neutral lessons. 🟡 Intermediate
Voice AI Prompt Engineering Guide (VoiceInfra) Explains why voice prompts must be 60–70% shorter than chat prompts, with templates. 🟢 Beginner
Function Calling for Voice Agents (LiveKit Docs) Concise guide to defining tools and RPC inside a voice agent. 🟡 Intermediate

6. Voice activity detection and turn-taking

Pure VAD is no longer enough modern agents combine acoustic VAD with a small semantic model that predicts end-of-utterance from words and prosody.

Silero VAD MIT-licensed pre-trained VAD; <1 ms per chunk on CPU. The de facto VAD inside LiveKit and Pipecat. 🟢 Beginner
py-webrtcvad Python bindings for Google’s classic WebRTC VAD; lightweight baseline. 🟢 Beginner
LiveKit Turn Detector blog post How a SmolLM-based EOU model complements VAD with semantic context. 🟡 Intermediate
LiveKit turn-detector model on HuggingFace Open-weights multilingual EOU model running ONNX on CPU in under 500 MB. 🟡 Intermediate
Pipecat Smart Turn v3 Whisper-Tiny-based audio semantic VAD with 12 ms CPU inference, BSD-2 licensed. 🟡 Intermediate
pipecat-ai/smart-turn Repo with model code, training scripts, and integration examples. 🟡 Intermediate
The Complete Guide to AI Turn-Taking (Tavus) Reader-friendly overview of why pure VAD fails in real conversations. 🟢 Beginner
Tackling Turn Detection in Voice AI (Notch) Engineer-first walkthrough combining VAD probability, volume, and TTS markers. 🟡 Intermediate
ai-coustics VAD VAD bundled with real-time speech enhancement, noise suppression, and voice isolation in a single audio preprocessing SDK; useful when you want cleanup and turn-taking signals from the same component. 🟢 Beginner

7. Audio enhancement and noise suppression

The audio reaching your VAD and STT is often noisy, reverberant, or mixed with background voices. Cleaning the signal before the rest of the pipeline is frequently the difference between an agent that ships and one that frustrates users in real-world conditions (cars, cafés, call centres).

ai-coustics Real-time speech enhancement SDK covering noise cancellation, voice isolation, and VAD; on-device and cloud deployment. See the docs and developer platform. 🟢 Beginner

8. WebRTC fundamentals

WebRTC is the default transport for voice agents that don’t run over the phone network. Understanding ICE, STUN, TURN, and SFU architecture is non-negotiable for production work.

MDN WebRTC API Authoritative free reference for RTCPeerConnection, getUserMedia, and signaling. 🟢 Beginner
MDN: Introduction to WebRTC Protocols Beginner-friendly explanation of ICE, STUN, TURN, and SDP. 🟢 Beginner
WebRTC.org Getting Started Official Google-maintained intro, splitting WebRTC into media-capture and connectivity. 🟢 Beginner
GetStream WebRTC for the Brave Free multi-module tutorial covering networking basics through advanced topics. 🟢 Beginner
Why WebRTC Beats WebSockets for Voice AI (LiveKit) 2025 explainer aimed at AI builders, comparing transports in plain English. 🟡 Intermediate
Daily Docs Intro to Video Architecture (P2P vs SFU) One of the clearest beginner write-ups of P2P vs SFU. 🟢 Beginner
Agora How WebRTC Works Side-by-side WebRTC vs WebSockets walkthrough with signaling diagrams. 🟢 Beginner

9. Telephony and SIP

The phone network has its own physics. Once you know which SIP trunk provider to point at LiveKit or Pipecat, you can ship.

Twilio Programmable Voice TwiML, Voice API, and PSTN connectivity in one hub; the default starting point. 🟢 Beginner
Twilio: Voice AI Assistant with OpenAI Realtime + Python Step-by-step junior-friendly tutorial wiring Twilio Media Streams to an LLM. 🟢 Beginner
Twilio SIP Quickstart Clearest beginner explainer of SIP basics, SIP Domains, and softphone setup. 🟢 Beginner
Telnyx Voice API Strong Twilio alternative with WebSocket media streaming and an AI Assistant tooling. 🟢 Beginner
Telnyx How to Set Up a SIP Trunk Friendly walkthrough of SIP trunking architecture, codecs, and authentication. 🟢 Beginner
Plivo Voice API Documentation XML call control and audio-streaming integrations for AI agents. 🟢 Beginner
SignalWire Voice Docs Built on FreeSWITCH; SWML, TwiML-compatible API, and an AI Agents SDK. 🟡 Intermediate
LiveKit SIP Primer Best diagram of how a call flows from PSTN → trunk → SIP service → agent. 🟢 Beginner
LiveKit SIP Trunk Setup Practical guide for wiring Twilio/Telnyx/Plivo trunks into LiveKit. 🟡 Intermediate
Pipecat Telephony Overview Differences between WebSocket-based telephony and SIP-based call control. 🟡 Intermediate

10. Tutorials and hands-on projects

Pick one tutorial and finish it before starting another. Voice AI is unforgiving of half-built pipelines.

LiveKit Voice AI Quickstart Official 10-minute walkthrough in Python or Node with starter templates. 🟢 Beginner
Build Your First AI Voice Agent in Python (LiveKit) End-to-end Python tutorial covering streaming, latency, and deployment. 🟢 Beginner
Pipecat Quickstart Build and deploy a Deepgram + OpenAI + Cartesia bot in roughly 10 minutes. 🟢 Beginner
How to Build a Real-Time Voice Agent with Pipecat (AssemblyAI) Production-oriented walkthrough including local testing and Pipecat Cloud deployment. 🟡 Intermediate
Deepgram Build a Voice AI Agent Step-by-step guide wiring Deepgram STT, GPT, and Aura TTS. 🟢 Beginner
Build a Voice Assistant with Twilio ConversationRelay + LiteLLM Provider-agnostic tutorial supporting OpenAI, Anthropic, or DeepSeek. 🟡 Intermediate
freeCodeCamp Build Advanced AI Agents (LiveKit, Exa, LangChain) Free 3-part video course covering interactive voice agents end-to-end. 🟢 Beginner
freeCodeCamp Private On-Device Voice Assistant Hands-on local stack with Whisper, a local LLM, and system TTS. 🟡 Intermediate

11. GitHub starter repos and awesome lists

Clone these instead of writing boilerplate from scratch.

livekit/agents The flagship open-source Python/Node framework for production voice agents. 🟢 → 🔴
pipecat-ai/pipecat Vendor-neutral framework with 40+ STT/LLM/TTS service plugins. 🟢 → 🔴
livekit-examples/agent-starter-python Production-ready starter with Dockerfile, eval suite, turn detector, and core plugins. 🟢 Beginner
livekit-examples (org) Official collection of LiveKit Python/React/Swift/Android starters. 🟢 Beginner
pipecat-ai/pipecat-examples Sample apps for push-to-talk, websocket, telephony, and multimodal use cases. 🟢 → 🟡
elevenlabs/elevenlabs-examples Runnable Next.js and Python examples for TTS, STT, and real-time agents. 🟢 Beginner
vocodedev/vocode-core Open-source modular framework for voice-LLM agents on phone, Zoom, or system audio. 🟡 Intermediate (less actively maintained than LiveKit/Pipecat)
kwindla/macos-local-voice-agents Pipecat example hitting sub-800 ms voice-to-voice latency entirely on M-series Macs. 🟡 Intermediate
zzw922cn/awesome-speech-recognition-speech-synthesis-papers Comprehensive curated index of ASR, TTS, voice conversion, and speech-LLM papers. 🟡 Intermediate
wildminder/awesome-ai-voice Up-to-date 2025–2026 list of open-source TTS and voice-cloning models.
CorentinJ/Real-Time-Voice-Cloning Classic 5-second voice cloning project for understanding TTS fundamentals. 🟡 Intermediate

12. Datasets and benchmarks

You’ll rarely train from scratch, but knowing which dataset a model was trained on explains its accents, languages, and failure modes.

LibriSpeech ASR Corpus ~1,000 hours of English audiobooks; nearly every ASR paper benchmarks against it. 🟢 Beginner
Mozilla Common Voice Crowdsourced multilingual dataset (100+ languages); the easiest legal way to fine-tune ASR. 🟢 Beginner
Common Voice on HuggingFace One-line load_dataset() access for hands-on experiments. 🟢 Beginner
Open ASR Leaderboard Live comparison of 60+ ASR models on WER and real-time factor. 🟢 Beginner
Artificial Analysis Speech Independent benchmarks of commercial STT and TTS providers. 🟢 Beginner
LJSpeech Dataset ~24 hours of single-speaker English audio; baseline corpus for Tacotron 2 and VITS. 🟢 Beginner
VCTK Corpus ~110 English speakers with diverse accents; widely used for multi-speaker TTS. 🟡 Intermediate
VoxCeleb (Oxford VGG) Million-utterance “in the wild” dataset for speaker identification and verification. 🟡 Intermediate

13. Beginner-accessible research papers

These are the landmark papers behind the models you’ll actually use. Read the Whisper and Common Voice papers first they’re unusually approachable.

Whisper Robust Speech Recognition via Large-Scale Weak Supervision (2022) Behind the most popular open ASR model; unusually clear prose for an ML paper. 🟡 Intermediate
HuggingFace Whisper fine-tuning blog (companion) Hands-on walkthrough that lets you “feel” the Whisper paper in code. 🟢 Beginner
VITS Conditional VAE with Adversarial Learning for End-to-End TTS (2021) The single-stage TTS model behind many open-source voice cloners. 🟡 Intermediate
Tacotron 2 Natural TTS Synthesis (2017) Landmark seq2seq + WaveNet-vocoder paper that made neural TTS sound natural. 🟡 Intermediate
Conformer Convolution-augmented Transformer for ASR (2020) The architecture inside NVIDIA Parakeet, Canary, and many leaderboard models. 🟡 Intermediate
wav2vec 2.0 Self-Supervised Learning of Speech Representations (2020) Showed that pretraining on unlabeled audio drastically reduces labeled-data needs. 🟡 Intermediate
Common Voice A Massively-Multilingual Speech Corpus (2020) Short, accessible paper describing how Common Voice is built and validated. 🟢 Beginner
Open ASR Leaderboard preprint (2025) Reproducible benchmark of 60+ ASR models across 11 datasets; the modern landscape map. 🟡 Intermediate

14. Evaluation and testing

You can’t ship what you can’t measure. Voice-agent evaluation is fundamentally probabilistic a single transcript can pass and fail across runs, so simulation and statistics matter more than fixed test cases.

Coval Voice AI Testing Platform Defines the core voice-agent metrics: TTFB, WER, resolution rate, simulated accents, and interruptions. 🟢 Beginner
Coval How to Evaluate Voice Agents (Practical Guide) One of the most cited 2025 guides on probabilistic vs deterministic evaluation. 🟢 Beginner
Cekura Metrics Overview Predefined metrics, instruction-following checks, and simulation framework. 🟢 Beginner
Cekura Performance Testing for Voice Agents Practical 2025 guide on multi-turn simulation and edge-case generation. 🟡 Intermediate
Hamming AI Production-focused QA platform with simulation, load testing, and 50+ metrics. 🟡 Intermediate
Hamming Voice Agent Evaluation Metrics Guide Reference of latency percentiles, WER, MOS-style quality, and task completion with formulas. 🟡 Intermediate
LiveKit Understand and Improve Agent Latency Per-turn latency metrics (e2e, LLM TTFT, TTS TTFB) and where to optimize. 🟡 Intermediate
Twilio How Do You Know if Your Voice AI Agents Are Working? Vendor-neutral 2025 guide arguing for business-outcome metrics over raw WER/latency. 🟢 Beginner
Future AGI simulate-sdk Open-source voice AI simulation SDK for testing AI agents; generates synthetic conversations for evaluation. 🟡 Intermediate

15. Production, deployment, and scaling

Real production voice infrastructure is the hardest unsolved problem in this space. Read these before quoting anyone a per-minute price.

LiveKit Deploy and scale agents on LiveKit Cloud Real-world write-up on stateful load balancing, autoscaling, and warm pools. 🟡 Intermediate
LiveKit Why You Shouldn’t Build Voice Agents Directly on Model APIs Honest breakdown of what raw model APIs don’t give you. 🟡 Intermediate
Latent Space OpenAI Realtime API: The Missing Manual Field-tested guide from Pipecat’s creator on Realtime API production realities. 🟡 Intermediate
TWIML Building Voice AI Agents That Don’t Suck (Kwindla Kramer) One-hour discussion on real production architecture and turn-taking. 🟡 Intermediate
AWS Voice Agents with Pipecat and Amazon Bedrock Full architecture walkthrough including latency optimization and Nova Sonic. 🟡 Intermediate
Deepgram STT API Pricing Breakdown Vendor-by-vendor per-minute economics required reading before signing any contract. 🟢 Beginner
Sierra Shipping and Scaling AI Agents Case-study on Sonos, SiriusXM, and OluKai voice deployments. 🟡 Intermediate
Sierra Constellation of Models How a leading CX company composes 15+ models per agent. 🟡 Intermediate
LiveKit Agent Observability Built-in tracing, transcripts, and per-stage latency for LiveKit Cloud. 🟢 Beginner

16. Ethics, safety, and regulation

If you’re shipping a voice agent in 2026, disclosure and consent are no longer optional. The FCC and EU AI Act both have teeth.

FCC AI-Generated Voices in Robocalls Illegal (Feb 2024) The landmark TCPA ruling every U.S. voice-agent dev must read. 🟢 Beginner
EU AI Act Article 50 (Transparency for Deepfakes & AI Interactions) Authoritative text of EU disclosure rules; takes effect August 2026. 🟡 Intermediate
European Commission Code of Practice on AI-Generated Content Official EU implementation guidance on watermarking and labelling. 🟡 Intermediate
FTC Approaches to Address AI-Enabled Voice Cloning Plain-English summary of the Voice Cloning Challenge winners and Impersonation Rule. 🟢 Beginner
FTC Final Impersonation Rule (Feb 2024) Direct source on U.S. impersonation-fraud rules covering AI deepfakes. 🟢 Beginner
Pindrop 2025 Voice Intelligence & Security Report Industry report documenting a 1,300% rise in deepfake fraud attempts. 🟢 Beginner
Voice Cloning Ethics (CAMB.AI) Practical overview of consent frameworks, ELVIS Act, and EU AI Act. 🟢 Beginner
NCLC Top Six TCPA/Robocall Developments 2024/2025 Consumer-protection lens on what’s actually being enforced. 🟡 Intermediate

17. Blogs and newsletters

Subscribe to two or three to stay current the field moves quickly.

LiveKit Blog Engineering deep-dives on WebRTC, agents framework releases, and production patterns.
Deepgram Learn Tutorials on STT/TTS, voice agent design, evals, and pipeline architecture.
Cartesia Blog State-space TTS models, Sonic releases, and yearly “State of Voice AI” reports.
ElevenLabs Blog Product and research announcements with implementation notes.
Daily.co Blog (Pipecat) Posts from Pipecat’s maintainers covering scaling and feature releases.
Voice AI & Voice Agents Illustrated Primer Free, regularly-updated long-form primer.
Latent Space (swyx & Alessio) AI Engineer newsletter and podcast with frequent voice-AI episodes.
Voice AI Newsletter (Krisp) “Future of Voice AI” interview series with founders; published weekly in 2025.
Voice AI Weekly (Vapi) Weekly Substack rounding up news, products, and tools.
Voicebot.ai (Synthedia) Long-running daily news and paid newsletter on industry trends.

18. Podcasts

The Voicebot Podcast (Bret Kinsella) Longest-running serious voice-tech podcast; weekly founder interviews.
Latent Space The AI Engineer Podcast Top US tech podcast; regularly covers Realtime API, Pipecat, Voxtral, Gemini Live.
The Future of Voice AI (Krisp) Weekly founder interviews focused on enterprise voice AI architecture.
TWIML AI Podcast voice episodes Strong technical interviews; the Kwin Kramer episode is a great starting point.
This Week In Voice (Project Voice) News-roundtable format covering conversational AI.

19. Communities

LiveKit Community Slack Direct access to maintainers and other agent builders.
Pipecat Discord Active community with weekly office hours; invite link from the homepage.
HuggingFace Discord #ml-for-audio-and-speech 200k-member server with strong audio/speech channels.
Vapi Discord Builder community for Vapi voice agents; invite from the homepage.
Retell AI Discord Discord for Retell developers building phone-call voice agents.
ElevenLabs Discord Large TTS, voice cloning, and Conversational AI community with daily help threads.
Deepgram Discord STT/TTS/Voice Agent API support and build-with-us threads.
Reddit r/LocalLLaMA Active threads on local Whisper/Parakeet, on-device TTS, and end-to-end voice stacks.
Reddit r/AI_Agents General AI-agent community where voice topics surface frequently.

20. Conferences and events

AI Engineer World’s Fair Biggest AI-engineering conference; the Voice track has hosted major launches from ElevenLabs, Vapi, LiveKit, Pipecat, and Cartesia. 🟢 Beginner
AI Engineer YouTube channel All World’s Fair and Summit talks are posted free; the best library of recent voice-AI talks. 🟢 Beginner
AI Engineer Summit Online Voice playlist Curated playlist including voice-track sessions from leading labs. 🟢 Beginner
AIEWF 2025 Recap (Latent Space) Written deep-dive into 2025’s voice-track talks and major launches. 🟢 Beginner
VOICE & AI (Modev) Long-running voice technology conference with broader CX and voicebot focus. 🟢 Beginner
Project Voice Main U.S. event for conversational AI across voice, text, and chat. 🟢 Beginner
Interspeech Top academic speech-science conference; intimidating but worth knowing most landmark papers debut here. 🔴 Advanced

21. Hackathons and competitions

ElevenLabs Worldwide Hackathon Flagship global hackathon for conversational agents; 30+ cities and a $200K+ prize pool. 🟢 Beginner
ElevenHacks (weekly sprints) Weekly themed challenges with credits and prizes; low-pressure way to ship one project per week. 🟢 Beginner
AI Engineer World’s Fair Hackathon Co-located with the conference; $10K prizes judged by 3,000+ AI engineers, with a strong voice track. 🟡 Intermediate
lablab.ai AI Hackathons Continuous calendar of short online hackathons frequently sponsored by voice-AI vendors. 🟢 Beginner
Devpost Voice AI Hackathons Centralized search for active voice-AI hackathons; the best way to find what’s open right now. 🟢 Beginner

Suggested learning path

Week 1 Foundations: Read the LiveKit pipeline post and Voice AI Illustrated Primer (sections 1, 8).
Week 2 First agent: Finish the LiveKit or Pipecat quickstart end-to-end (sections 2, 10).
Week 3 Components: Swap STT, TTS, and LLM providers; benchmark latency (sections 3, 4, 5).
Week 4 Turn-taking, audio cleanup & telephony: Add Silero VAD, a turn detector, and a speech-enhancement pass; connect a SIP trunk (sections 6, 7, 9).
Week 5 Production: Add evaluation, observability, and read the FCC/EU AI Act material (sections 14, 15, 16).
Ongoing: Subscribe to two newsletters and join voice ai community in linkedin (sections 17, 18, 19).

Contributing

Pull requests welcome. Resources must be active in the last 12 months, accessible to developers, and vendor-neutral or clearly labeled when authored by a commercial party. Open an issue to suggest additions or removals.

⭐ Stargazers and contributors

📜 License

MIT. Fork it, ship it.

This site is open source. Improve this page.

voiceai

How to use this list

📘 Companion book: Voice Agents Handbook

Table of contents

1. Foundational concepts and learning paths

2. Frameworks and orchestration platforms

Open-source frameworks

Managed platforms

Realtime / speech-to-speech APIs

Vendor-neutral comparisons

3. Speech-to-text (STT / ASR)

Commercial APIs

Open source

Benchmarks and explainers

4. Text-to-speech (TTS)

Commercial APIs

Open source

Streaming and ethics

5. LLMs for voice and real-time AI

Low-latency inference

Speech-to-speech models

Voice-specific prompting and tools

6. Voice activity detection and turn-taking

7. Audio enhancement and noise suppression

8. WebRTC fundamentals

9. Telephony and SIP

10. Tutorials and hands-on projects

11. GitHub starter repos and awesome lists

12. Datasets and benchmarks

13. Beginner-accessible research papers

14. Evaluation and testing

15. Production, deployment, and scaling

16. Ethics, safety, and regulation

17. Blogs and newsletters

18. Podcasts

19. Communities

20. Conferences and events

21. Hackathons and competitions

Suggested learning path

Contributing

⭐ Stargazers and contributors

📜 License