Multi-ASR Toolkit is a flexible and extensible speech recognition toolkit supporting multiple backend engines such as Whisper, Faster-Whisper, WhisperX, SpeechRecognition, and Vosk. It provides both a command-line interface and a web-based interface via Gradio, facilitating easy transcription of audio files using various ASR models.
- Python 3.10 or higher
- Gradio requires Python 3.10 or higher.
- pygame
- pydub
- ffmpeg: for convert mp3 to wav
- PyTorch 2.1+, TensorFlow 2.6+
- transformers
- SpeechRecognition
- Whisper
- Faster Whisper
- MLX Whisper
- demucs
- yt-dlp
-
Python packages
$ pip3 install -r requirements.txt
-
ffmpeg
# Ubuntu $ sudo apt install ffmpeg # Mac $ brew install ffmpeg
For
Windows, you can refer to this website: ffmpeg install
# python app.py --mode cli <wav/mp3 file>
$ python app.py --mode cli data/test.mp3
# python app.py --mode cli <wav/mp3 file> --backend <asr backend> --language <language> --model-size <model size>
$ python app.py --mode cli data/test.mp3 --backend faster-whisper --language en --model-size base$ python3 app.py-
Open a new incognito/private window and log in to your YouTube account.
-
In the same tab, open https://www.youtube.com/robots.txt, ensuring that only this tab is using the login session.
-
Use a browser extension (e.g., "cookies.txt" for Chrome) to export the youtube.com cookies for this session to a file named cookies.txt, and then immediately close the incognito window.
-
Using the manually exported cookies.txt in Python.
Place your cookies.txt in a fixed path, and then specify it in ydl_opts like this:
ydl_opts = { 'format': 'bestvideo[ext=mp4]+bestaudio[ext=m4a]/mp4', 'outtmpl': filepath, 'verbose': True, # 用于调试,正式可去掉 'merge_output_format': 'mp4', 'cookies': '<Path>/youtube_cookies.txt', }
reference: yt-dlp/wiki/Extractors
-
I just copied the demucs folder from the demucs repo into the backends folder.