Thanks to visit codestin.com
Credit goes to mahimairaja.github.io

voiceai

Banner Image

English version δΈ­ζ–‡η‰ˆζœ¬

A curated, developer friendly learning path for building real-time voice AI agents from your first STT call to scaling production telephony.

Voice AI has moved from research demos into shipping product in under three years. The modern stack is converging around a clear pattern: a real-time transport layer (WebRTC or telephony), a streaming pipeline of speech-to-text β†’ LLM β†’ text-to-speech, and a turn-taking model that decides when the agent should speak. This list is structured to mirror that learning order start with the foundations, pick a framework, then drill into individual components and production concerns.

Resources are tagged 🟒 Beginner, 🟑 Intermediate, or πŸ”΄ Advanced. Prefer free official docs and vendor-neutral guides; flag where authors have commercial interests.


How to use this list

Read top-to-bottom if you’re brand new. The recommended path:

  1. Foundations β†’ understand the pipeline and latency budget
  2. Frameworks β†’ pick one (LiveKit Agents or Pipecat are the safest open-source bets) and ship a hello-world
  3. Components (STT, TTS, LLM, VAD, turn detection) β†’ swap pieces to learn what each layer does
  4. Transport & telephony β†’ connect to a real phone number
  5. Evaluation, production, ethics β†’ make it safe enough to ship


πŸ“˜ Companion book: Voice Agents Handbook

If you want this material in a tighter, opinionated, production-grade form, I wrote the Voice Agents Handbook: building production voice AI with LiveKit, plus appendices on choosing your stack and the LiveKit ecosystem beyond agents. Ships June 1, 2026 on Kindle.

The README you’re reading collects the field’s best free resources. The book is the curated path through them, with the patterns I’ve used shipping voice agents for trade people, lawyers, and immigration consultants.

Disclosure: I maintain this repo and authored the handbook. Free sample (Introduction + Chapter 1) at handbook.mahimai.ca.


Table of contents

  1. Foundational concepts and learning paths
  2. Frameworks and orchestration platforms
  3. Speech-to-text (STT / ASR)
  4. Text-to-speech (TTS)
  5. LLMs for voice and real-time AI
  6. Voice activity detection and turn-taking
  7. Audio enhancement and noise suppression
  8. WebRTC fundamentals
  9. Telephony and SIP
  10. Tutorials and hands-on projects
  11. GitHub starter repos and awesome lists
  12. Datasets and benchmarks
  13. Beginner-accessible research papers
  14. Evaluation and testing
  15. Production, deployment, and scaling
  16. Ethics, safety, and regulation
  17. Blogs and newsletters
  18. Podcasts
  19. Communities
  20. Conferences and events
  21. Hackathons and competitions

1. Foundational concepts and learning paths

Start here. These resources establish the mental model of the voice agent pipeline and the latency budget you’ll fight for the rest of your career.

2. Frameworks and orchestration platforms

The frameworks below all let you wire STT, an LLM, and TTS together. For open-source production work, LiveKit Agents and Pipecat are the two safest bets; for managed dashboards, Vapi, Retell, and Bland win on time-to-first-call.

Open-source frameworks

Managed platforms

Realtime / speech-to-speech APIs

Vendor-neutral comparisons

3. Speech-to-text (STT / ASR)

Pick one streaming STT and learn it deeply before shopping around. Deepgram, AssemblyAI, and Whisper-derivatives cover most use cases.

Commercial APIs

Open source

Benchmarks and explainers

4. Text-to-speech (TTS)

Latency, not raw quality, is what kills voice agents prioritize providers offering true streaming with first-byte under 200 ms.

Commercial APIs

Open source

Streaming and ethics

5. LLMs for voice and real-time AI

A voice agent’s perceived intelligence is bounded by how fast the LLM streams its first token. Sub-300 ms TTFT changes the conversation feel entirely.

Low-latency inference

Speech-to-speech models

Voice-specific prompting and tools

6. Voice activity detection and turn-taking

Pure VAD is no longer enough modern agents combine acoustic VAD with a small semantic model that predicts end-of-utterance from words and prosody.

7. Audio enhancement and noise suppression

The audio reaching your VAD and STT is often noisy, reverberant, or mixed with background voices. Cleaning the signal before the rest of the pipeline is frequently the difference between an agent that ships and one that frustrates users in real-world conditions (cars, cafΓ©s, call centres).

8. WebRTC fundamentals

WebRTC is the default transport for voice agents that don’t run over the phone network. Understanding ICE, STUN, TURN, and SFU architecture is non-negotiable for production work.

9. Telephony and SIP

The phone network has its own physics. Once you know which SIP trunk provider to point at LiveKit or Pipecat, you can ship.

10. Tutorials and hands-on projects

Pick one tutorial and finish it before starting another. Voice AI is unforgiving of half-built pipelines.

11. GitHub starter repos and awesome lists

Clone these instead of writing boilerplate from scratch.

12. Datasets and benchmarks

You’ll rarely train from scratch, but knowing which dataset a model was trained on explains its accents, languages, and failure modes.

13. Beginner-accessible research papers

These are the landmark papers behind the models you’ll actually use. Read the Whisper and Common Voice papers first they’re unusually approachable.

14. Evaluation and testing

You can’t ship what you can’t measure. Voice-agent evaluation is fundamentally probabilistic a single transcript can pass and fail across runs, so simulation and statistics matter more than fixed test cases.

15. Production, deployment, and scaling

Real production voice infrastructure is the hardest unsolved problem in this space. Read these before quoting anyone a per-minute price.

16. Ethics, safety, and regulation

If you’re shipping a voice agent in 2026, disclosure and consent are no longer optional. The FCC and EU AI Act both have teeth.

17. Blogs and newsletters

Subscribe to two or three to stay current the field moves quickly.

18. Podcasts

19. Communities

20. Conferences and events

21. Hackathons and competitions


Suggested learning path

  1. Week 1 Foundations: Read the LiveKit pipeline post and Voice AI Illustrated Primer (sections 1, 8).
  2. Week 2 First agent: Finish the LiveKit or Pipecat quickstart end-to-end (sections 2, 10).
  3. Week 3 Components: Swap STT, TTS, and LLM providers; benchmark latency (sections 3, 4, 5).
  4. Week 4 Turn-taking, audio cleanup & telephony: Add Silero VAD, a turn detector, and a speech-enhancement pass; connect a SIP trunk (sections 6, 7, 9).
  5. Week 5 Production: Add evaluation, observability, and read the FCC/EU AI Act material (sections 14, 15, 16).
  6. Ongoing: Subscribe to two newsletters and join voice ai community in linkedin (sections 17, 18, 19).

Contributing

Pull requests welcome. Resources must be active in the last 12 months, accessible to developers, and vendor-neutral or clearly labeled when authored by a commercial party. Open an issue to suggest additions or removals.

⭐ Stargazers and contributors

Star History Chart

Contributors

πŸ“œ License

MIT. Fork it, ship it.