Pyadintool is a pre-processing toolkit covering voice activity detection, recording, splitting and sending of audio stream. This toolkit has been developed as a simple python clone of adintool in Julius Japanese ASR. This toolkit is developed mainly for academic research (easy to use) and example use. Note that coding standards, error handling, comments and so on in this toolkit are not suitable for joint development.
- 2025/11: release version 1.0 -- packaging as
pyadin. we canpip installit from github repository.- caution: directory structure/path had been changed
Suitable for real-time applications on PC, e.g., spoken dialogue system
- Support sending segmented audio data to (adin) servers
- Support long recording and saving it to files
- Support GUI plot for realtime monitoring
- Support batch processing using filelist
Tentative function
- Echo cancellation (suppression of a known signal) using pre-trained filter
Real-time processing on CPUs using multi-threading (desirable at least two or three cores)
- Power-based VAD in time domain
- DNN-HMM VAD in STFT domain
- Silero VAD
- Julius (in Japanese)
- ESPnet streaming ASR (in Japanese)
- Faster Whisper
- Better SNR (signal-to-noise ratio), better performance
- Better microphone (stand alone), better performance
- Near-field recording, better performance
- Uncompressed audio, better performance
- Performance may be affected by other settings
- sampling frequency
- amplitude characteristics of low-pass filter used for band limitation
- source separation, speech enhancement, noise reduction and echo cancellation methods
- Activity estimation based on signal power and its threshold parameter
- Assumptions
- number of speakers: only one (single speaker)
- 😄 fast and light
☹️ un-robust against noise
- Activity estimation based on machine learning model: HMM and DNN
- Assumptions
- number of speakers: only one (single speaker)
- language: Japanese may be better (due to the model's training set)
- acceptable latency: 0.2 sec.
- 😄 scale-invariant processing and multi-conditioned training of model
- less influenced by the gain setting of audio devices
- robust against assumed non-speech signals
- stable for long recording such as spoken dialogue data
☹️ performance dependency on model and training data (general in ML methods)- latter part of long vowels tends not to be detected
- coughs are sometimes detected (not included in training data)
- consonant-like noise are sometimes detected
- 👌 Several model parameters (2025/7/16 updated)
- v1 -- Trial setup for scale-invariant processing: Acc. 93.20, F1 94.14.
- v2 -- Robustness against noise was improved to some extent, and model size becomes lighter: Acc. 94.20, F1 94.61.
- Activity estimation based on machine learning model: LSTM
- 😄 moderately fast and light, and stable performance
☹️ performance dependency on model and training data (general in ML methods)- latter part of long vowels tends not to be detected
- CPU: multi-core is better
- No confirmation using CUDA
- Ubuntu 22.04, Ubuntu 22.04 on Windows WSL
- 😄 Available: Power-based VAD, DNN-HMM VAD, Silero VAD
- Required libraries
- alsa-utils
- libasound2-dev
- libportaudio2
- ..., and other GUI and audio libraries
- Windows 11 (basically not supported)
- 😄 Available: Power-based VAD, Silero VAD
☹️ Unavailable: DNN-HMM VAD (pytorch TransformerEncoderLayer is something wrong on windows?)- Latest windows updates
- Solve problems (fbgemm.dll and its depended libomp140.x86_64.dll) as said in issues
- Python3.10 or Python3.11 (Python3.9 if GUI plot is not used) and main libraries. See "requirements.txt" for details.
- torch
- torchaudio
- torchcodec (add 2025/11/06. required for audio file)
- numpy
- pyyaml
- sounddevice
- huggingface_hub
- safetensors
- pyqtgraph (for real-time plot)
- PySide6 (for real-time plot)
- ASR examples
- ESPnet (on Ubuntu): Python3.10
- Python3.11 may cause error in installing "sentencepiece"
- Faster Whisper (on Ubuntu): Python3.10, Python3.11
- ESPnet (on Ubuntu): Python3.10
expand
- Copy and edit the shell script: change the python version and other options
cp setup_ubuntu.sh setup_ubuntu_local.sh
python=python3.10
enable_espnet=true #false
python_espnet=python3.10
enable_whisper=false #true
python_whisper=python3.10
- Run "setup_ubuntu_local.sh" to automatically install necessary libraries for ubuntu environment. "sudo apt install" and "pip install" commands are used in the script.
- Note: we have not specified actual required libaries. Therefore, some unnecessary libaries may be installed by "apt install".
bash setup_ubuntu_local.sh
- Activate venv when you run our python scripts. The above script create "venv" environment (venv/main) in the current directory.
venv/
+ main/ # venv for pyadintool
+ espnet/ # venv for ESPnet ASR (valid if enable_espent=true)
+ whisper/ # venv for Whisper ASR (valid if enable_whisper=true)
- Appropriate python version and virtual environment are assumed
- Python libraries can be also installed by using "requirements.txt" for "pyadintool" (exact versions in our environment)
pip3 install -r requirements.txt
- (Optional) ESPnet and Faster Whisper can be installed simply by
pip3 install espnet torchaudio torchcodec espnet_model_zoo
pip3 install faster_whisper
expand
- Create virtual environemnt
python3 -m venv venv\main
.\venv\main\Scripts\activate
- Install python libraries by using batch file
setup_win.bat
- Sometimes edit the batch file to change the python version
- Activate appropriate virtual environment
. venv/main/bin/activate # for ubuntu
.\venv\main\Scripts\activate # for windows
- Pyadintool requires a configuration file for execution
python3 pyadintool.py [conf]
- Check available sound devices (device list) if necessary.
python3 pyadintool.py devinfo
--- available device list ---
0 oss, ALSA (6 in, 6 out)
1 pulse, ALSA (32 in, 32 out)
* 2 default, ALSA (32 in, 32 out)
3 /dev/dsp, OSS (16 in, 16 out)
- Use the default configuration with DNN-HMM VAD
- input stream: "mic"
- output stream: "file" (saved in "result/" directory)
- sampling frequency and channel: 16k Hz and 1
python3 pyadintool.py conf/default4asr.yaml
- We can also try our latest model version as
python3 pyadintool.py conf/default4asr_v2.yaml
- Change the audio device by using "--device" option. The device ID (or name) must be selected from the device list.
python3 pyadintool.py conf/default4asr.yaml --device 2
- Switch to power-based VAD or Silero VAD configuration if you want
python3 pyadintool.py conf/power4asr.yaml
python3 pyadintool.py conf/silero4asr.yaml
The following command displays the input signal and detection results for monitoring.
python3 pyadintool.py conf/default4asr_v2.yaml --enable_plot
expand
python3 pyadintool.py conf/default4asr.yaml --in file
echo auido.wav | python3 pyadintool.py conf/default4asr.yaml --in file
expand
python3 pyadintool.py conf/default4asr.yaml --out file
python3 pyadintool.py conf/default4asr.yaml --out file --filename segs/result_%Y%m%d_%H%M_%R.wav --startid 0
- Available format
- %Y: year
- %m: month
- %d: day
- %H: hour
- %M: minutes
- %S: second
- %u: host name
- %R: rotation id
expand
- Run "adinnet" server (ASR example) before running "pyadintool.py". This server receives segmented audio data from "pyadintool.py" client.
- Set up ESPnet or Whisper
. venv/espent/bin/activate
python3 egs_asr.py ESPnet
. venv/whisper/bin/activate
python3 egs_asr.py Whisper
- or, set up Julius
sudo apt install git-lfs
git lfsinstall
git clone https://github.com/julius-speech/dictation-kit
cd dictation-kit
sh run-linux-dnn.sh -input adinnet -adport 5530
- Then, run the main script with adinnet option. Stop it by Ctrl-C.
python3 pyadintool.py conf/default4asr.yaml --out adinnet
python3 pyadintool.py conf/default4asr.yaml --out adinnet --server localhost --port 5530
- Send data to several ASRs
python3 pyadintool.py conf/default4asr.yaml --out adinnet --server localhost,l92.168.1.30 --port 5530,5530
expand
Just include several options in a string, such as "adinnet" and "file".
python3 pyadintool.py conf/default.yaml --out adinnet-file
expand
python3 pyadintool.py conf/default.yaml --enable_timestamp --timestampfile result.lab
expand
- In the case of long recording, the logging is important to check the behavior of "pyadintool.py"
- e.g. "buffer overflow" may happen while reading audio signal from device
python3 pyadintool.py conf/default4asr.yaml --enable_logsave
python3 pyadintool.py conf/default4asr.yaml --enable_logsave --logfilefmt log_%Y%m%d.log
- Available file format
- %Y: year
- %m: month
- %d: day
- %H: hour
- %M: minutes
- %S: second
- %u: host name
- %R: rotation id
expand
python3 pyadintool.py conf/default4asr.yaml --enable_list --inlist wavlist.txt --tslist tslist.txt
Filenames of Audio and label data are listed in "wavlist.txt" and "tslist.txt"
data001.wav
data002.wav
data001.lab
data002.lab
expand
python3 pyadintool.py conf/default4asr.yaml --device default
expand
python3 pyadintool.py conf/default4asr.yaml --enable_plot
expand
Output file will be automatically rotated after "rotate_min".
python3 pyadintool.py conf/default4asr.yaml --enable_rawsave
"%R" should be used in the fileformat options to avoid overwriting.
python3 pyadintool.py conf/default4asr.yaml --enable_rawsave --rawfilefmt raw/%Y%m%d/record_%u_%R_%H%M%S.wav
python3 pyadintool.py conf/default4asr.yaml --enable_rawsave --rawfilefmt raw/%Y%m%d/record_%u_%R_%H%M%S.wav --rotate_min 30
expand
It is better to create a new configuration file for this purpose.
python3 pyadintool.py conf/default4asr.yaml --in mic --out file-adinnet --enable_logsave --enable_rawsave --server localhost --port 5530
expand
This function assumes the cancellation of system utterances for spoken dialogue system.
Available under limited environment.
- Valid only for "--in mic" option
- 2-channel audio inputs
- ch1: microphone input signal
- ch2: loopback signal (output signal from loud speaker)
- Static transfer function
- position of mic. and loud speaker never changes
- Filter estimation in advance for stable performance
- Incomplete cancellation
- VAD may still detect system utterances
Run "auxtool" to estimate filter in advance. Filter parameters are saved in "conf/ecfilter.txt" file.
python3 auxtool.py calib_filter
Run "pyadintool" with the configuration file for echo cancellation.
python3 pyadintool.py config/default4ecasr.yaml --in mic --enable_plot
If you want to update filter parameters dynamically, change the learning rate ("mu") of "lms" in the configuration file.
enable_ec: True
lms:
L: 512
mu: 0.0
filterfile: conf/ecfilter.txt
floorfile:
Some parameters should be set through yaml configuration files such as "default4asr.yaml", "power4asr.yaml" and "silero4asr.yaml".
expand
- Change "freq" parameter in configuration file
- "freq" parameter is included in several modules
- We also need to change several parameters because they are described in "sample" unit
expand
- Change "margin_begin" and "margin_end" parameters. Their unit is "second".
- "shift_time" represents a buffering time (inevitable latency) of each method
- the detected times of each segment are modified by this parameter in order to set the internal time to actual time.
- These default configurations are different among methods.
postproc:
package: usr.tdvad
class: PostProc
params:
freq: 16000
margin_begin: 0.20
margin_end: 0.20
shift_time: 0.23
expand
- Change "flramp" parameter ranged in [0, 32768]. Smaller is more sensitive to signal power
- Change "n_win" parameter to use longer window for calculation of moving averaged power
package: usr.tdvad
class: SimpleVAD
params:
n_win: 800
n_skip: 80
flramp: 500
thre: 0.5
nbits: 16
expand
- Change "pw_1" parameter in "dnnhmmfilter.yaml". Smaller is more sensitive to speech signal.
- The value "0.1" or smaller may be effective under high SNR environments.
probfilter:
classname: BinaryProbFilter
package: usr.fdvad
params:
trp1_self: 0.99
trp2_self: 0.99
pw_1: 0.5
- In addition, a detection threshold can be set to ignore low-power backgroud noises and residual signals from echo canceller. Change the "min_thre" value according to your environment.
tagger:
package: usr.fdvad
class: stftSlidingVAD
params:
yamlfile: conf/dnnhmmfilter.yaml
min_frame: 2
nshift: 160
nbuffer: 12000
device: cpu
dtype: float32
nthread: 3
min_thre: 1.0 # no threshold if we set it to 0.0.
- The thoreshold above can be estimated via pre-recording using "auxtool.py" in advance.
$ python3 auxtools.py calib_framepower
[LOG]: calibrate power
[LOG]: now recording ...
[LOG]: estimated frame-power: mean: 0.7391, std: 0.1577
expand
- Change "thre" parameter. Smaller is more sensitive to signal power
tagger:
package: usr.silerovad
class: SileroVAD
params:
freq: 16000
thre: 0.5
All default parameters need to be set in the configuration file. The command line options will overwrite the default configurations.
expand
- Set input stream. "mic" or "file".
- Set output stream. "file", "adinnet" and both "adinnet-file"
- Data format of "adinnet"
- segmented data
- 4-byte int: represents audio data length in bytes (N)
- N bytes: binary audio data
- end of segment
- 4-byte int: 0 (zero)
- segmented data
- Set output filename
- Set start id for rotation filename, e.g., 0
- "%R" in filename is replaced into the current rotation ID
- Set hostnames of adinserver, e.g., localhost
- Set ports of adinserver, e.g., 5530
- Set sampling frequency of input stream in Hz, e.g., 16000
- Set sampling frequency of input stream in Hz, e.g., 1
- Set target channels, e.g. --tgt_chs 0 1.
- Selected channels will be extracted from audio input stream.
- Set ID or name of audio device, e.g., 1
- Set input audio filename if "--in file" is valid
- Available only if "--in file" option is set
- Save log to the file
- Set fileformat for log
- Available only if "--enable_logsave" option is set
- Save raw input stream to the file
- Set fileformat for raw audio data, e.g., "rawfile_%Y%d%m_%R.wav"
- Available only if "--enable_rawsave" option is set
- Set duration time in minutes for saving raw audio files, e.g., 30
- Available only if "--enable_rawsave" option is set
- Save timestamp of audio segments to the file
- Set filename for saving timestamps
- Available only if "--enable_timestamp" option is set
- Plot waveform and speech activity on GUI
- Run batch processing
- Set audio file list for batch processing
- Available only if "--enable_list" option is set
- Set timestamp file list for batch processing
- Available only if "--enable_list" option is set
We need to install the system libraries, e.g., of Ubuntu by apt install command.
It may include alsa-utils, libasound2-dev, libporaudio2, and so on.
It is better to follow the setup_ubutu.sh for their installtions.
python3 -m venv venv
. venv/bin/activate
python3 -m pip install git+https://github.com/ouktlab/pyadintool.git
Please use the following command if you want to enable wave plot.
python3 -m pip install pyadintool[gui]@git+https://github.com/ouktlab/pyadintool.git
import pyadin
For example, please create main.py which is the same source code as pyadintool.py
import pyadin
if __name__ == "__main__":
pyadin.app_pyadintool()
Then, run the main.py with package's default configuration file egs_conf/default4asr.yaml.
python3 main.py egs_conf/default4asr.yaml --enable_plot
@inproceedings {
author={Ryu Takeda and Kazunori Komatani},
title={Scale-invariant Online Voice Activity Detection under Various Environments},
year={2024},
booktitle={Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)},
pages={1--6},
doi={10.1109/APSIPAASC63619.2025.10848584},
}