Github repository for podaboutlist.vercel.app, a semantic search engine for transcripts of the podcast "Podcast about List" by Cameron Fetter, Patrick Doran, and Caleb Pitts.
Even though the scripts and files in this repo are specific to this project, the general workflow could easily be adapted to achieve the same result for any podcast (or similar innovations in the audiovisual space):
- Generate transcripts (for example using whisperx) or if applicable fetch auto-generated subtitles from youtube
- Generate vector embeddings from the transcripts (you will need to seperate the text into chunks first)
- Input the embeddings along with some metadata (episode title, timestamp) into a vector database (for example pinecone)
- Create a back-end API that can receive search requests, generates embeddings for them, and queries the database with those embeddings
- Create a front-end that allows the user to send text to the back-end and displays the results
Amazingly, all these steps can be achieved for free (if you own hardware that can handle the first step):
- VoyageAI offers an embedding API with 50 million free tokens
- pinecone offers a free solution with which can host up to 100.000 vectors (the ca. 450 transcripts of 60 minute episodes only amount to 80.000 vectors)
- pythonanywhere offers free hosting for python web applications
- For the front-end there are a lot of different options, since even a statically served page could manage the job (I used vercel because I wanted to learn next.js + react)