Pyadintool

Pyadintool is a pre-processing toolkit covering voice activity detection, recording, splitting and sending of audio stream. This toolkit has been developed as a simple python clone of adintool in Julius Japanese ASR. This toolkit is developed mainly for academic research (easy to use) and example use. Note that coding standards, error handling, comments and so on in this toolkit are not suitable for joint development.

2025/11: release version 1.0 -- packaging as pyadin. we can pip install it from github repository.
- caution: directory structure/path had been changed

Key Features

Interface

Suitable for real-time applications on PC, e.g., spoken dialogue system

Support sending segmented audio data to (adin) servers
Support long recording and saving it to files
Support GUI plot for realtime monitoring
Support batch processing using filelist

Tentative function

Echo cancellation (suppression of a known signal) using pre-trained filter

Supported Voice Activity Detection (VAD)

Real-time processing on CPUs using multi-threading (desirable at least two or three cores)

Power-based VAD in time domain
DNN-HMM VAD in STFT domain
Silero VAD

Example of ASR interface

Julius (in Japanese)
ESPnet streaming ASR (in Japanese)
Faster Whisper

VAD Features

Common

Better SNR (signal-to-noise ratio), better performance
Better microphone (stand alone), better performance
Near-field recording, better performance
Uncompressed audio, better performance
Performance may be affected by other settings
- sampling frequency
- amplitude characteristics of low-pass filter used for band limitation
- source separation, speech enhancement, noise reduction and echo cancellation methods

Power-based VAD

Activity estimation based on signal power and its threshold parameter
Assumptions
- number of speakers: only one (single speaker)
😄 fast and light
☹️ un-robust against noise

DNN-HMM VAD

Activity estimation based on machine learning model: HMM and DNN
Assumptions
- number of speakers: only one (single speaker)
- language: Japanese may be better (due to the model's training set)
- acceptable latency: 0.2 sec.
😄 scale-invariant processing and multi-conditioned training of model
- less influenced by the gain setting of audio devices
- robust against assumed non-speech signals
- stable for long recording such as spoken dialogue data
☹️ performance dependency on model and training data (general in ML methods)
- latter part of long vowels tends not to be detected
- coughs are sometimes detected (not included in training data)
- consonant-like noise are sometimes detected
👌 Several model parameters (2025/7/16 updated)
- v1 -- Trial setup for scale-invariant processing: Acc. 93.20, F1 94.14.
- v2 -- Robustness against noise was improved to some extent, and model size becomes lighter: Acc. 94.20, F1 94.61.

Silero VAD

Activity estimation based on machine learning model: LSTM
😄 moderately fast and light, and stable performance
☹️ performance dependency on model and training data (general in ML methods)
- latter part of long vowels tends not to be detected

Installation

System Requirements

CPU: multi-core is better
- No confirmation using CUDA
Ubuntu 22.04, Ubuntu 22.04 on Windows WSL
- 😄 Available: Power-based VAD, DNN-HMM VAD, Silero VAD
- Required libraries
  - alsa-utils
  - libasound2-dev
  - libportaudio2
  - ..., and other GUI and audio libraries
Windows 11 (basically not supported)
- 😄 Available: Power-based VAD, Silero VAD
- ☹️ Unavailable: DNN-HMM VAD (pytorch TransformerEncoderLayer is something wrong on windows?)
- Latest windows updates
- Solve problems (fbgemm.dll and its depended libomp140.x86_64.dll) as said in issues
  - latest Visual C++ x86/64 build tools from Visual Studio Microsoft
  - latest Visual C++ redistributable package from Microsoft (for windows 10?)
  - (optional) other required modules for torchaudio, etc...
Python3.10 or Python3.11 (Python3.9 if GUI plot is not used) and main libraries. See "requirements.txt" for details.
- torch
- torchaudio
- torchcodec (add 2025/11/06. required for audio file)
- numpy
- pyyaml
- sounddevice
- huggingface_hub
- safetensors
- pyqtgraph (for real-time plot)
- PySide6 (for real-time plot)
ASR examples
- ESPnet (on Ubuntu): Python3.10
  - Python3.11 may cause error in installing "sentencepiece"
- Faster Whisper (on Ubuntu): Python3.10, Python3.11

Setup on Ubuntu

expand

Shell script

Copy and edit the shell script: change the python version and other options

cp setup_ubuntu.sh setup_ubuntu_local.sh

python=python3.10
enable_espnet=true #false
python_espnet=python3.10
enable_whisper=false #true
python_whisper=python3.10

Run "setup_ubuntu_local.sh" to automatically install necessary libraries for ubuntu environment. "sudo apt install" and "pip install" commands are used in the script.
- Note: we have not specified actual required libaries. Therefore, some unnecessary libaries may be installed by "apt install".

bash setup_ubuntu_local.sh

Activate venv when you run our python scripts. The above script create "venv" environment (venv/main) in the current directory.

  venv/
    + main/    # venv for pyadintool
    + espnet/  # venv for ESPnet ASR (valid if enable_espent=true)
    + whisper/ # venv for Whisper ASR (valid if enable_whisper=true)

pip command for python libraries

Appropriate python version and virtual environment are assumed
Python libraries can be also installed by using "requirements.txt" for "pyadintool" (exact versions in our environment)

pip3 install -r requirements.txt

(Optional) ESPnet and Faster Whisper can be installed simply by

pip3 install espnet torchaudio torchcodec espnet_model_zoo

pip3 install faster_whisper

Setup on Windows

expand

Create virtual environemnt

python3 -m venv venv\main
.\venv\main\Scripts\activate

Install python libraries by using batch file

setup_win.bat

Sometimes edit the batch file to change the python version

Run with default settings

General procedure

Activate appropriate virtual environment

. venv/main/bin/activate   # for ubuntu

.\venv\main\Scripts\activate    # for windows

Pyadintool requires a configuration file for execution

python3 pyadintool.py [conf]

Check available sound devices (device list) if necessary.

python3 pyadintool.py devinfo
--- available device list ---
  0 oss, ALSA (6 in, 6 out)
  1 pulse, ALSA (32 in, 32 out)
* 2 default, ALSA (32 in, 32 out)
  3 /dev/dsp, OSS (16 in, 16 out)

Use the default configuration with DNN-HMM VAD
- input stream: "mic"
- output stream: "file" (saved in "result/" directory)
- sampling frequency and channel: 16k Hz and 1

python3 pyadintool.py conf/default4asr.yaml

We can also try our latest model version as

python3 pyadintool.py conf/default4asr_v2.yaml

Change the audio device by using "--device" option. The device ID (or name) must be selected from the device list.

python3 pyadintool.py conf/default4asr.yaml --device 2

Switch to power-based VAD or Silero VAD configuration if you want

python3 pyadintool.py conf/power4asr.yaml

python3 pyadintool.py conf/silero4asr.yaml

Recommended trial command for monaural microphone input

The following command displays the input signal and detection results for monitoring.

python3 pyadintool.py conf/default4asr_v2.yaml --enable_plot

Examples

Example-01: Set an audio file as input stream

expand

python3 pyadintool.py conf/default4asr.yaml --in file

echo auido.wav | python3 pyadintool.py conf/default4asr.yaml --in file

Example-02: Save segmented audio signals to files

expand

python3 pyadintool.py conf/default4asr.yaml --out file

python3 pyadintool.py conf/default4asr.yaml --out file --filename segs/result_%Y%m%d_%H%M_%R.wav --startid 0

Available format
- %Y: year
- %m: month
- %d: day
- %H: hour
- %M: minutes
- %S: second
- %u: host name
- %R: rotation id

Example-03: Send segmented audio signals to ASR servers

expand

Run "adinnet" server (ASR example) before running "pyadintool.py". This server receives segmented audio data from "pyadintool.py" client.
Set up ESPnet or Whisper

. venv/espent/bin/activate
python3 egs_asr.py ESPnet

. venv/whisper/bin/activate
python3 egs_asr.py Whisper

or, set up Julius

sudo apt install git-lfs
git lfsinstall
git clone https://github.com/julius-speech/dictation-kit
cd dictation-kit
sh run-linux-dnn.sh -input adinnet -adport 5530

Then, run the main script with adinnet option. Stop it by Ctrl-C.

python3 pyadintool.py conf/default4asr.yaml --out adinnet

python3 pyadintool.py conf/default4asr.yaml --out adinnet --server localhost --port 5530

Send data to several ASRs

python3 pyadintool.py conf/default4asr.yaml --out adinnet --server localhost,l92.168.1.30 --port 5530,5530

Example-04: Set multiple output streams

expand

Just include several options in a string, such as "adinnet" and "file".

python3 pyadintool.py conf/default.yaml --out adinnet-file

Example-05: Save timesamps of VAD to a file

expand

python3 pyadintool.py conf/default.yaml --enable_timestamp --timestampfile result.lab

Example-06: Logging

expand

In the case of long recording, the logging is important to check the behavior of "pyadintool.py"
- e.g. "buffer overflow" may happen while reading audio signal from device

python3 pyadintool.py conf/default4asr.yaml --enable_logsave

python3 pyadintool.py conf/default4asr.yaml --enable_logsave --logfilefmt log_%Y%m%d.log

Available file format
- %Y: year
- %m: month
- %d: day
- %H: hour
- %M: minutes
- %S: second
- %u: host name
- %R: rotation id

Example-07: Batch processing for filelist

expand

python3 pyadintool.py conf/default4asr.yaml --enable_list --inlist wavlist.txt --tslist tslist.txt

Filenames of Audio and label data are listed in "wavlist.txt" and "tslist.txt"

data001.wav
data002.wav

data001.lab
data002.lab

Example-08: Set input device name or ID

expand

python3 pyadintool.py conf/default4asr.yaml --device default

Example-09: Real-time plot for monitoring

expand

python3 pyadintool.py conf/default4asr.yaml --enable_plot

Clikc the close button of the window to stop

Example-10: Save raw recording to files

expand

Output file will be automatically rotated after "rotate_min".

python3 pyadintool.py conf/default4asr.yaml --enable_rawsave

"%R" should be used in the fileformat options to avoid overwriting.

python3 pyadintool.py conf/default4asr.yaml --enable_rawsave --rawfilefmt raw/%Y%m%d/record_%u_%R_%H%M%S.wav

python3 pyadintool.py conf/default4asr.yaml --enable_rawsave --rawfilefmt raw/%Y%m%d/record_%u_%R_%H%M%S.wav --rotate_min 30

Example-11: Set up for real-time ASR applications using adinnet

expand

It is better to create a new configuration file for this purpose.

python3 pyadintool.py conf/default4asr.yaml --in mic --out file-adinnet --enable_logsave --enable_rawsave --server localhost --port 5530

Example-12: Use echo canceller (tentative)

expand

This function assumes the cancellation of system utterances for spoken dialogue system.
Available under limited environment.

Valid only for "--in mic" option
2-channel audio inputs
- ch1: microphone input signal
- ch2: loopback signal (output signal from loud speaker)
Static transfer function
- position of mic. and loud speaker never changes
Filter estimation in advance for stable performance
Incomplete cancellation
- VAD may still detect system utterances

Run "auxtool" to estimate filter in advance. Filter parameters are saved in "conf/ecfilter.txt" file.

python3 auxtool.py calib_filter

Run "pyadintool" with the configuration file for echo cancellation.

python3 pyadintool.py config/default4ecasr.yaml --in mic --enable_plot

If you want to update filter parameters dynamically, change the learning rate ("mu") of "lms" in the configuration file.

enable_ec: True
lms:
  L: 512
  mu: 0.0
  filterfile: conf/ecfilter.txt
  floorfile:

Tuning/Change Configuration

Some parameters should be set through yaml configuration files such as "default4asr.yaml", "power4asr.yaml" and "silero4asr.yaml".

Common: sampling frequency

expand

Change "freq" parameter in configuration file
"freq" parameter is included in several modules
We also need to change several parameters because they are described in "sample" unit

Common: margin parameters for audio segmentation

expand

Change "margin_begin" and "margin_end" parameters. Their unit is "second".
"shift_time" represents a buffering time (inevitable latency) of each method
- the detected times of each segment are modified by this parameter in order to set the internal time to actual time.
These default configurations are different among methods.

postproc:
  package: usr.tdvad
  class: PostProc
  params:
    freq: 16000
    margin_begin: 0.20
    margin_end: 0.20
    shift_time: 0.23

Power-based VAD: threshold parameter

expand

Change "flramp" parameter ranged in [0, 32768]. Smaller is more sensitive to signal power
Change "n_win" parameter to use longer window for calculation of moving averaged power

  package: usr.tdvad
  class: SimpleVAD
  params:
    n_win: 800
    n_skip: 80
    flramp: 500
    thre: 0.5
    nbits: 16

DNN-HMM VAD: threshold parameter

expand

Change "pw_1" parameter in "dnnhmmfilter.yaml". Smaller is more sensitive to speech signal.
The value "0.1" or smaller may be effective under high SNR environments.

probfilter:
  classname: BinaryProbFilter
  package: usr.fdvad
  params:
    trp1_self: 0.99
    trp2_self: 0.99
    pw_1: 0.5

In addition, a detection threshold can be set to ignore low-power backgroud noises and residual signals from echo canceller. Change the "min_thre" value according to your environment.

tagger:
  package: usr.fdvad
  class: stftSlidingVAD
  params:
    yamlfile: conf/dnnhmmfilter.yaml
    min_frame: 2
    nshift: 160
    nbuffer: 12000
    device: cpu
    dtype: float32
    nthread: 3
    min_thre: 1.0  # no threshold if we set it to 0.0.

The thoreshold above can be estimated via pre-recording using "auxtool.py" in advance.

$ python3 auxtools.py calib_framepower
[LOG]: calibrate power
[LOG]: now recording ...
[LOG]: estimated frame-power: mean: 0.7391, std: 0.1577

Silero VAD: threshold parameter

expand

Change "thre" parameter. Smaller is more sensitive to signal power

tagger:
  package: usr.silerovad
  class: SileroVAD
  params:
    freq: 16000
    thre: 0.5

Options

All default parameters need to be set in the configuration file. The command line options will overwrite the default configurations.

expand

--in [IN]

Set input stream. "mic" or "file".

--out [OUT]

Set output stream. "file", "adinnet" and both "adinnet-file"
Data format of "adinnet"
- segmented data
  - 4-byte int: represents audio data length in bytes (N)
  - N bytes: binary audio data
- end of segment
  - 4-byte int: 0 (zero)

--filename [FILENAME]

Set output filename

--startid [ID]

Set start id for rotation filename, e.g., 0
"%R" in filename is replaced into the current rotation ID

--server [HOSTNAME]

Set hostnames of adinserver, e.g., localhost

--port [PORT]

Set ports of adinserver, e.g., 5530

--freq [FREQ]

Set sampling frequency of input stream in Hz, e.g., 16000

--nch [NCH]

Set sampling frequency of input stream in Hz, e.g., 1

--tgt_chs [TGT_CHS]

Set target channels, e.g. --tgt_chs 0 1.
Selected channels will be extracted from audio input stream.

--device [DEVICE]

Set ID or name of audio device, e.g., 1

--infile [INFILE]

Set input audio filename if "--in file" is valid
Available only if "--in file" option is set

--enable_logsave

Save log to the file

--logfilefmt [LOGFILEFMT]

Set fileformat for log
Available only if "--enable_logsave" option is set

--enable_rawsave

Save raw input stream to the file

--rawfilefmt [LOGFILEFMT]

Set fileformat for raw audio data, e.g., "rawfile_%Y%d%m_%R.wav"
Available only if "--enable_rawsave" option is set

--rotate_min [ROTATE_MIN]

Set duration time in minutes for saving raw audio files, e.g., 30
Available only if "--enable_rawsave" option is set

--enable_timestamp

Save timestamp of audio segments to the file

--timestampfile [TIMESTAMPFILE]

Set filename for saving timestamps
Available only if "--enable_timestamp" option is set

--enable_plot

Plot waveform and speech activity on GUI

--enable_list

Run batch processing

--inlist [INLIST]

Set audio file list for batch processing
Available only if "--enable_list" option is set

--tslist [TSLIST]

Set timestamp file list for batch processing
Available only if "--enable_list" option is set

Use as Package

Pip install via github

0. Install system libraries

We need to install the system libraries, e.g., of Ubuntu by apt install command.
It may include alsa-utils, libasound2-dev, libporaudio2, and so on.
It is better to follow the setup_ubutu.sh for their installtions.

1. Activate virtual environment

python3 -m venv venv
. venv/bin/activate

2. Install pyadintool by pip from GitHub

python3 -m pip install git+https://github.com/ouktlab/pyadintool.git

Please use the following command if you want to enable wave plot.

python3 -m pip install pyadintool[gui]@git+https://github.com/ouktlab/pyadintool.git

3. Import "pyadin" package (not "pyadintool")

import pyadin

Example

Run test program

For example, please create main.py which is the same source code as pyadintool.py

import pyadin
if __name__ == "__main__":
    pyadin.app_pyadintool()

Then, run the main.py with package's default configuration file egs_conf/default4asr.yaml.

python3 main.py egs_conf/default4asr.yaml --enable_plot

Citations

@inproceedings {
  author={Ryu Takeda and Kazunori Komatani},
  title={Scale-invariant Online Voice Activity Detection under Various Environments},
  year={2024},
  booktitle={Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)},
  pages={1--6},
  doi={10.1109/APSIPAASC63619.2025.10848584},
}

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
conf		conf
images		images
pyadin		pyadin
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
auxtools.py		auxtools.py
egs_asr.py		egs_asr.py
egs_increment.py		egs_increment.py
pyadintool.py		pyadintool.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup_ubuntu.sh		setup_ubuntu.sh
setup_win.bat		setup_win.bat

License

ouktlab/pyadintool

Folders and files

Latest commit

History

Repository files navigation

Pyadintool

Key Features

Interface

Supported Voice Activity Detection (VAD)

Example of ASR interface

VAD Features

Common

Power-based VAD

DNN-HMM VAD

Silero VAD

Installation

System Requirements

Setup on Ubuntu

Shell script

pip command for python libraries

Setup on Windows

Run with default settings

General procedure

Recommended trial command for monaural microphone input

Examples

Example-01: Set an audio file as input stream

Example-02: Save segmented audio signals to files

Example-03: Send segmented audio signals to ASR servers

Example-04: Set multiple output streams

Example-05: Save timesamps of VAD to a file

Example-06: Logging

Example-07: Batch processing for filelist

Example-08: Set input device name or ID

Example-09: Real-time plot for monitoring

Example-10: Save raw recording to files

Example-11: Set up for real-time ASR applications using adinnet

Example-12: Use echo canceller (tentative)

Tuning/Change Configuration

Common: sampling frequency

Common: margin parameters for audio segmentation

Power-based VAD: threshold parameter

DNN-HMM VAD: threshold parameter

Silero VAD: threshold parameter

Options

--in [IN]

--out [OUT]

--filename [FILENAME]

--startid [ID]

--server [HOSTNAME]

--port [PORT]

--freq [FREQ]

--nch [NCH]

--tgt_chs [TGT_CHS]

--device [DEVICE]

--infile [INFILE]

--enable_logsave

--logfilefmt [LOGFILEFMT]

--enable_rawsave

--rawfilefmt [LOGFILEFMT]

--rotate_min [ROTATE_MIN]

--enable_timestamp

--timestampfile [TIMESTAMPFILE]

--enable_plot

--enable_list

--inlist [INLIST]

--tslist [TSLIST]

Use as Package

Pip install via github

0. Install system libraries

1. Activate virtual environment

2. Install pyadintool by pip from GitHub

3. Import "pyadin" package (not "pyadintool")

Example

Run test program

Citations

About

Resources

License

Uh oh!

Packages