MaAI

A Real-time and Light-weight Software for Generation of Non-Linguistic Behaviors in Conversational AIs

(Real-time Implementation of Voice Activity Projection)

MaAI is a state-of-the-art and light-weight software that can generate (predict) non-linguistic behaviors in real time and continuously. It supports essential interaction elements such as (1) Turn-Taking, (2) Backchanneling, and (3) Head Nodding. Currently available for English, Chinese, and Japanese languages, MaAI will continue to expand its language coverage and non-linguistic behavior repertoire in the future. Designed specifically for conversational AI, including spoken dialogue systems and interactive robots, MaAI handles audio input effectively in either two-channels (user-system) or single-channel (user-only) settings🎙️ Thanks to its lightweight design, MaAI operates efficiently, even exclusively on CPU hardware⚡

The name MaAI is derived from the Japanese words Ma(間) or Maai(間合い), which refer to the subtle timing and spacing that humans adjust using various modalities during interactions.
The AI in MaAI literally stands for Artificial Intelligence, reflecting the aim to develop AI technologies related to these interactional dynamics.

The currently supported models are mainly based on the Voice Activity Projection (VAP) model and its extensions. Details about the VAP model can be found in the following repository: https://github.com/ErikEkstedt/VoiceActivityProjection

For system development or collaborative research using MaAI software, please contact Koji Inoue at Kyoto University.

Demo video on YouTube (https://www.youtube.com/watch?v=-uwB6yl2WtI)

🆕 Update

We launched the MaAI project and repository here! 🚀 (August 13th, 2025)

🚀 Getting Started

To quickly get started with MaAI, you can install it using pip:

pip install maai

💡 Note: By default, the CPU version of PyTorch will be installed. If you wish to run MaAI on a GPU, please install the GPU version of PyTorch that matches your CUDA environment before proceeding.

You can run it as follows🏃‍♂️ The appropriate model for the task (mode) and parameters will be downloaded automatically.
Below is an example of using the turn-taking model (VAP) with the first channel as microphone input (user) and the second channel as silence (system).

from maai import Maai, MaaiInput, MaaiOutput

mic = MaaiInput.Mic()
zero = MaaiInput.Zero() 

maai = Maai(mode="vap", lang="jp", frame_rate=10, context_len_sec=5, audio_ch1=mic, audio_ch2=zero, device="cpu")
maai_output_bar = MaaiOutput.ConsoleBar(bar_type="balance")

maai.start()
while True:
    result = maai.get_result()
    maai_output_bar.update(result)

🧩 Models

We support the following models (behavior, language, audio setting, etc.), and more models will be added in the future. Currently available models can be found in our HuggingFace repository.

Turn-Taking

The turn-taking model uses the original VAP as is and predicts which participant will speak in the next moment.

VAP Model
Noise-Robust VAP Model (Recommended)
[Single-Channel VAP Model] (In Preparation ...)

Backchannel

Backchannels are short listener responses such as yeah and oh, that are also related to turn-taking.

VAP-based Backchannel Prediction Model - Timing for Two types
[VAP-based Backchannel Prediction Model - Timing Only] (In Preparation ...)

Nodding

Nodding refers to the up-and-down movement of the head and is closely related to backchanneling. Unlike backchannels that involve vocal responses, nodding allows the listener to express their reaction non-verbally.

VAP-based Nodding Prediction Model

🎚️ Input / Output

For input to the MaAI model, you can directly call the process method of a Maai class instance.
The MaaiInput class also provides flexible input options, supporting audio from WAV files, microphone input, and TCP communication.

WAV file input: Wav class 📁
Microphone input: Mic class 🎙️
TCP communication: TCPReceiver / TCPTransmitter classes 🌐
Chunk input: Chunk class 📦

By using these classes, you can easily adapt the audio input method to your specific use case.

For output, you can retrieve the processing results using the get_result method of the Maai class instance. The MaaiOutput class also supports several ways of visualization and also TCP communication.

Console Dynamic Output: ConsoleBar class 📊
GUI bar graph output: GuiBar class 🖼️
GUI plot output: GuiPlot class 📈
TCP communication: TCPReceiver / TCPTransmitter classes 🌐

For more details, please refer to the following README files:

💡 Example Implementation

You can find example implementations of MaAI models in the example directory of this repository.

Turn-Taking (VAP)
- With 1 mic input 🎤
- With 2 mic inputs 🎤🎤
- With 1 mic and 1 wav file input 🎤🎧
- With 2 wav file inputs 🎧🎧
- With 1 mic chunk input 🎤
- With 1 mic input via TCP 🎤🌐
- With 2 wav file inputs and TCP output 🎧🎧🌐
- With 1 mic chunk input via TCP 🎤🌐
Noise-Robust Turn-Taking (VAP)
- With 1 mic input 🎤
Backchannel
- With 1 mic input 🎤
Nodding
- With 1 mic input 🎤
Output
- Console Dynamic Output 📊
- GUI bar graph output 🖼️
- GUI plot output 📈
- TCP communication 🌐

📚 Publication

Please cite the following paper, if you made any publications made with this repository🙏

Koji Inoue, Bing'er Jiang, Erik Ekstedt, Tatsuya Kawahara, Gabriel Skantze
Real-time and Continuous Turn-taking Prediction Using Voice Activity Projection
International Workshop on Spoken Dialogue Systems Technology (IWSDS), 2024
https://arxiv.org/abs/2401.04868

@inproceedings{inoue2024iwsds,
    author = {Koji Inoue and Bing'er Jiang and Erik Ekstedt and Tatsuya Kawahara and Gabriel Skantze},
    title = {Real-time and Continuous Turn-taking Prediction Using Voice Activity Projection},
    booktitle = {International Workshop on Spoken Dialogue Systems Technology (IWSDS)},
    year = {2024},
    url = {https://arxiv.org/abs/2401.04868},
}

If you use the multi-lingual VAP model, please also cite the following paper.

Koji Inoue, Bing'er Jiang, Erik Ekstedt, Tatsuya Kawahara, Gabriel Skantze
Multilingual Turn-taking Prediction Using Voice Activity Projection
Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING), pages 11873-11883, 2024
https://aclanthology.org/2024.lrec-main.1036/

@inproceedings{inoue2024lreccoling,
    author = {Koji Inoue and Bing'er Jiang and Erik Ekstedt and Tatsuya Kawahara and Gabriel Skantze},
    title = {Multilingual Turn-taking Prediction Using Voice Activity Projection},
    booktitle = {Proceedings of the Joint International Conference on Computational Linguistics and Language Resources and Evaluation (LREC-COLING)},
    pages = {11873--11883},
    year = {2024},
    url = {https://aclanthology.org/2024.lrec-main.1036/},
}

If you also use the noise-robusst VAP model, please also cite the following paper.

Koji Inoue, Yuki Okafuji, Jun Baba, Yoshiki Ohira, Katsuya Hyodo, Tatsuya Kawahara
A Noise-Robust Turn-Taking System for Real-World Dialogue Robots: A Field Experiment
https://www.arxiv.org/abs/2503.06241

@misc{inoue2025noisevap,
    author = {Koji Inoue and Yuki Okafuji and Jun Baba and Yoshiki Ohira and Katsuya Hyodo and Tatsuya Kawahara},
    title = {A Noise-Robust Turn-Taking System for Real-World Dialogue Robots: A Field Experiment},
    year = {2025},
    note = {arXiv:2503.06241},
    url = {https://www.arxiv.org/abs/2503.06241},
}

If you also use the backchannel VAP model, please also cite the following paper.

Koji Inoue, Divesh Lala, Gabriel Skantze, Tatsuya Kawaharaa
Yeah, Un, Oh: Continuous and Real-time Backchannel Prediction with Fine-tuning of Voice Activity Projection
https://aclanthology.org/2025.naacl-long.367/

@inproceedings{inoue2025vapbc,
    author = {Koji Inoue and Divesh Lala and Gabriel Skantze and Tatsuya Kawahara},
    title = {Yeah, Un, Oh: Continuous and Real-time Backchannel Prediction with Fine-tuning of Voice Activity Projection},
    booktitle = {Proceedings of the Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL)},
    pages = {7171--7181},
    year = {2025},
    url = {https://aclanthology.org/2025.naacl-long.367/},
}

📝 License

The source code in this repository is licensed under the MIT License. For the trained models, please follow the license described in the README of each model or on Hugging Face repository.

The pre-trained CPC model is from the original CPC project and please follow its specific license. Refer to the original repository at https://github.com/facebookresearch/CPC_audio for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 235 Commits
.github/workflows		.github/workflows
example		example
img		img
readme		readme
src/maai		src/maai
train		train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_JP.md		README_JP.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MaAI

🆕 Update

🚀 Getting Started

🧩 Models

Turn-Taking

Backchannel

Nodding

🎚️ Input / Output

💡 Example Implementation

📚 Publication

📝 License

About

Uh oh!

Releases

Packages

Contributors 11

Languages

License

MaAI-Kyoto/MaAI

Folders and files

Latest commit

History

Repository files navigation

MaAI

🆕 Update

🚀 Getting Started

🧩 Models

Turn-Taking

Backchannel

Nodding

🎚️ Input / Output

💡 Example Implementation

📚 Publication

📝 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 11

Languages

Packages