Thanks to visit codestin.com
Credit goes to github.com

Skip to content

hizening/M3PDB

Repository files navigation

M3PDB

Contributors Forks Stargazers Issues LinkedIn


Logo

M3PDB

A Multi-Modal, Multi-Label, Multilingual Prompt Database
Explore the documentation of this project Β»

View Demo (Demo and Subjective Test) Β· Report Bug Β· Make a Suggestion

This README.md is intended for developers.

Whatβ€˜s new πŸ”₯

Table of Contents

Getting Started Guide

Development Configuration Requirements

Due to the significant differences in the configuration environments of the various models in this study, we chose to use separate environments for each model in practice. These models interact through API calls to achieve collaboration. The configuration method for each model's environment is documented separately in its respective folder.

Installation Steps
  1. Get a free API Key at https://chatgpt.com/
  2. Clone the repo
git clone https://github.com/hizening/M3PDB.git
  1. Different systems require different environments. Please refer to the readme.md of each subsystem for configuration.

File Directory Description

filetree 
β”œβ”€β”€ /annotation_system/
β”‚  β”œβ”€β”€ /Qwen2-Audio/
β”‚  β”œβ”€β”€ /SenseVoice/
β”‚  β”œβ”€β”€ /emotion2vec/
β”‚  β”œβ”€β”€ /llmware/
β”‚  β”œβ”€β”€ /readme.md/
β”œβ”€β”€ /latency_aware_online_system/
β”‚  β”œβ”€β”€ /latency_aware_online_selection.py/
β”‚  β”œβ”€β”€ /readme.md/
β”œβ”€β”€ /multi-model_prompt_registration/
β”‚  β”œβ”€β”€ /facetts/
β”‚  β”œβ”€β”€ /f2s.py/
β”‚  β”œβ”€β”€ /s2s.py/
β”‚  β”œβ”€β”€ /t2s.py/
β”‚  β”œβ”€β”€ /readme.md/
β”œβ”€β”€ /multimodal_data_preprocessing/
β”‚  β”œβ”€β”€ /3D-Speaker/
β”‚  β”œβ”€β”€ /speech/
β”‚  β”œβ”€β”€ /video/
β”‚  β”œβ”€β”€ /readme.md/
β”œβ”€β”€ /unseen_language_annotation/
β”‚  β”œβ”€β”€ /lang_prob_confirm/
β”‚  β”œβ”€β”€ /selection/
β”‚  β”œβ”€β”€ /readme.md/

Dataset Construction

Multimodal Data Preprocessing

Logo

1.Run the code below to achieve audio-video separation.

python multimodal_data_preprocessing/video/split_media.py

2.Run the code below to achieve speech format standardization.

python multimodal_data_preprocessing/speech/format_standardization.py

3.Run the code below to achieve video format standardization.

python multimodal_data_preprocessing/video/format_standardization.py

4.Run the code below to achieve speech enhancement.

python multimodal_data_preprocessing/speech/speech_enhancement.py

5.Run the code below to achieve video quality enhancement.

python multimodal_data_preprocessing/video/VideoSuperResolution/Train/eval.py

6.Run the code below to achieve multimodal speaker diarization and VAD.

cd multimodal_data_preprocessing/3D-Speaker/egs/3dspeaker/speaker-diarization/
bash run_audio.sh
bash run_video.sh

...... For more detailed information, please read the /multimodal_data_preprocessing/readme.md.

Annotation System

Logo

For more detailed information, please read the /annotation_system/readme.md.

Unseen Language Annotation

Logo

1.Run the code below to generate speech.

python unseen_language_annotation/lang_prob_confirm/tts/tts.py

2.Run the code below to evaluate the quality of the synthesized speech.

python dnsmos_local.py -t C:\temp\SampleClips -o sample.csv

...... For more detailed information, please read the /unseen_language_annotation/readme.md.

Dataset Usage

Multi-model Prompt Registration

Logo

1.Run the code below to match and register speech similar to the registered speech.

python /multi-model_prompt_registration/s2s.py

2.Run the code below to generate phase-based reference speech based on the registered face.

python /multi-model_prompt_registration/facetts/inference.py

3.Run the code below to match and register speech similar to the registered face.

python /multi-model_prompt_registration/f2s.py

4.Run the code below to match and register speech similar to the registered text.

python /multi-model_prompt_registration//t2s.py

...... For more detailed information, please read the /multi-model_prompt_registration/readme.md.

Latency Aware Online Selection

Logo

1.Run the code below to dynamically find the most suitable speech.

python /latency_aware_online_selection/latency_aware_online_selection.py

...... For more detailed information, please read the /latency_aware_online_selection/readme.md.

How to Contribute to the Open Source Project

Contributions make the open-source community an excellent place for learning, inspiration, and creation. Any contribution you make is greatly appreciated.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Version Control

This project uses Git for version control. You can check the current available version in the repository.

Contact

If you have any comment or question about M3PDB, please contact us by

License

M3PDB is released under the CC BY-NC-4.0 license.

Acknowledgements

M3PDB contains third-party components and code modified from some open-source repos, including:

  1. datasets Emilia Dataset, voxceleb, voxpopuli
  2. code 3D-Speaker, Side-Profile-Detection, SenseVoice, emotion2vec, seamless_communication, CosyVoice, whisper, Imaginary Voice, whisper, gpt-4o, deepface, OSUM, XTTS-v2

About

No description, website, or topics provided.

Resources

License

Unknown, Unknown licenses found

Licenses found

Unknown
LICENSE
Unknown
LICENSE.txt

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published