A Multi-Modal, Multi-Label, Multilingual Prompt Database
Explore the documentation of this project
Β»
View Demo (Demo and Subjective Test)
Β·
Report Bug
Β·
Make a Suggestion
This README.md is intended for developers.
- Getting Started Guide
- File Directory Description
- Dataset Construction
- Dataset Usage
- How to Contribute to the Open Source Project
- Version Control
- Contact
- License
- Acknowledgements
Due to the significant differences in the configuration environments of the various models in this study, we chose to use separate environments for each model in practice. These models interact through API calls to achieve collaboration. The configuration method for each model's environment is documented separately in its respective folder.
- Get a free API Key at https://chatgpt.com/
- Clone the repo
git clone https://github.com/hizening/M3PDB.git- Different systems require different environments. Please refer to the
readme.mdof each subsystem for configuration.
filetree
βββ /annotation_system/
β βββ /Qwen2-Audio/
β βββ /SenseVoice/
β βββ /emotion2vec/
β βββ /llmware/
β βββ /readme.md/
βββ /latency_aware_online_system/
β βββ /latency_aware_online_selection.py/
β βββ /readme.md/
βββ /multi-model_prompt_registration/
β βββ /facetts/
β βββ /f2s.py/
β βββ /s2s.py/
β βββ /t2s.py/
β βββ /readme.md/
βββ /multimodal_data_preprocessing/
β βββ /3D-Speaker/
β βββ /speech/
β βββ /video/
β βββ /readme.md/
βββ /unseen_language_annotation/
β βββ /lang_prob_confirm/
β βββ /selection/
β βββ /readme.md/
1.Run the code below to achieve audio-video separation.
python multimodal_data_preprocessing/video/split_media.py2.Run the code below to achieve speech format standardization.
python multimodal_data_preprocessing/speech/format_standardization.py3.Run the code below to achieve video format standardization.
python multimodal_data_preprocessing/video/format_standardization.py4.Run the code below to achieve speech enhancement.
python multimodal_data_preprocessing/speech/speech_enhancement.py5.Run the code below to achieve video quality enhancement.
python multimodal_data_preprocessing/video/VideoSuperResolution/Train/eval.py6.Run the code below to achieve multimodal speaker diarization and VAD.
cd multimodal_data_preprocessing/3D-Speaker/egs/3dspeaker/speaker-diarization/
bash run_audio.sh
bash run_video.sh......
For more detailed information, please read the /multimodal_data_preprocessing/readme.md.
For more detailed information, please read the /annotation_system/readme.md.
1.Run the code below to generate speech.
python unseen_language_annotation/lang_prob_confirm/tts/tts.py2.Run the code below to evaluate the quality of the synthesized speech.
python dnsmos_local.py -t C:\temp\SampleClips -o sample.csv......
For more detailed information, please read the /unseen_language_annotation/readme.md.
1.Run the code below to match and register speech similar to the registered speech.
python /multi-model_prompt_registration/s2s.py2.Run the code below to generate phase-based reference speech based on the registered face.
python /multi-model_prompt_registration/facetts/inference.py3.Run the code below to match and register speech similar to the registered face.
python /multi-model_prompt_registration/f2s.py4.Run the code below to match and register speech similar to the registered text.
python /multi-model_prompt_registration//t2s.py......
For more detailed information, please read the /multi-model_prompt_registration/readme.md.
1.Run the code below to dynamically find the most suitable speech.
python /latency_aware_online_selection/latency_aware_online_selection.py......
For more detailed information, please read the /latency_aware_online_selection/readme.md.
Contributions make the open-source community an excellent place for learning, inspiration, and creation. Any contribution you make is greatly appreciated.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project uses Git for version control. You can check the current available version in the repository.
If you have any comment or question about M3PDB, please contact us by
- email: [email protected]
M3PDB is released under the CC BY-NC-4.0 license.
M3PDB contains third-party components and code modified from some open-source repos, including:
- datasets Emilia Dataset, voxceleb, voxpopuli
- code 3D-Speaker, Side-Profile-Detection, SenseVoice, emotion2vec, seamless_communication, CosyVoice, whisper, Imaginary Voice, whisper, gpt-4o, deepface, OSUM, XTTS-v2