Podcast about Search

Github repository for podaboutlist.vercel.app, a semantic search engine for transcripts of the podcast "Podcast about List" by Cameron Fetter, Patrick Doran, and Caleb Pitts.

Even though the scripts and files in this repo are specific to this project, the general workflow could easily be adapted to achieve the same result for any podcast (or similar innovations in the audiovisual space):

Generate transcripts (for example using whisperx) or if applicable fetch auto-generated subtitles from youtube
Generate vector embeddings from the transcripts (you will need to seperate the text into chunks first)
Input the embeddings along with some metadata (episode title, timestamp) into a vector database (for example pinecone)
Create a back-end API that can receive search requests, generates embeddings for them, and queries the database with those embeddings
Create a front-end that allows the user to send text to the back-end and displays the results

Amazingly, all these steps can be achieved for free (if you own hardware that can handle the first step):

VoyageAI offers an embedding API with 50 million free tokens
pinecone offers a free solution with which can host up to 100.000 vectors (the ca. 450 transcripts of 60 minute episodes only amount to 80.000 vectors)
pythonanywhere offers free hosting for python web applications
For the front-end there are a lot of different options, since even a statically served page could manage the job (I used vercel because I wanted to learn next.js + react)

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
backend		backend
frontend/podaboutlist		frontend/podaboutlist
transcripts_embeddings		transcripts_embeddings
transcripts_metadata		transcripts_metadata
transcripts_raw		transcripts_raw
transcripts_raw_json		transcripts_raw_json
.gitignore		.gitignore
convert_srt_json.py		convert_srt_json.py
generate_transcript_dictionary.py		generate_transcript_dictionary.py
pinecone_integration.py		pinecone_integration.py
process_transcripts.py		process_transcripts.py
readme.md		readme.md
transcript_list.json		transcript_list.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Podcast about Search

About

Uh oh!

Releases

Packages

Uh oh!

Languages

ryxxxx/podaboutsearch

Folders and files

Latest commit

History

Repository files navigation

Podcast about Search

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages