Thanks to k2-fsa/sherpa-onnx, we can easily build a voice API with Python.
| Model | Language | Type | Description |
|---|---|---|---|
| zipformer-bilingual-zh-en-2023-02-20 | Chinese + English | Online ASR | Streaming Zipformer, Bilingual |
| sense-voice-zh-en-ja-ko-yue-2024-07-17 | Chinese + English | Offline ASR | SenseVoice, Bilingual |
| sense-voice-zh-en-ja-ko-yue-int8-2025-09-09 | Chinese + English | Offline ASR | SenseVoice-small(int8), Bilingual |
| paraformer-trilingual-zh-cantonese-en | Chinese + Cantonese + English | Offline ASR | Paraformer, Trilingual |
| paraformer-en-2024-03-09 | English | Offline ASR | Paraformer, English |
| vits-zh-hf-theresa | Chinese | TTS | VITS, Chinese, 804 speakers |
| melo-tts-zh_en | Chinese + English | TTS | Melo, Chinese + English, 1 speakers |
| kokoro-multi-lang-v1_0 | Chinese + English | TTS | Chinese + English, 53 speakers |
Python 3.10+ is required
python3 -m venv venv
. venv/bin/activate
pip install -r requirements.txt
python app.pyVisit http://localhost:8000/ to see the demo page
docker build -t voiceapi:cuda_dev -f Dockerfile.cuda.cn .Send PCM 16bit audio data to the server, and the server will return the transcription result.
sampleratecan be set in the query string, default is 16000.
The server will return the transcription result in JSON format, with the following fields:
text: the transcription resultfinished: whether the segment is finishedidx: the index of the segment
const ws = new WebSocket('ws://localhost:8000/asr?samplerate=16000');
ws.onopen = () => {
console.log('connected');
ws.send('{"sid": 0}');
};
ws.onmessage = (e) => {
const data = JSON.parse(e.data);
const { text, finished, idx } = data;
// do something with text
// finished is true when the segment is finished
};
// send audio data
// PCM 16bit, with samplerate
ws.send(int16Array.buffer);Send text to the server, and the server will return the synthesized audio data.
sampleratecan be set in the query string, default is 16000.sidis the Speaker ID, default is 0.speedis the speed of the synthesized audio, default is 1.0.chunk_sizeis the size of the audio chunk, default is 1024.
The server will return the synthesized audio data in binary format.
- The audio data is in PCM 16bit format, with the binary data in the response body.
- The server will return the synthesized result with json format, with the following fields:
elapsed: the elapsed timeprogress: the progress of the synthesisduration: the duration of the synthesissize: the size of the synthesized audio data
const ws = new WebSocket('ws://localhost:8000/tts?samplerate=16000');
ws.onopen = () => {
console.log('connected');
ws.send('Your text here');
};
ws.onmessage = (e) => {
if (e.data instanceof Blob) {
// Chunked audio data
e.data.arrayBuffer().then((arrayBuffer) => {
const int16Array = new Int16Array(arrayBuffer);
let float32Array = new Float32Array(int16Array.length);
for (let i = 0; i < int16Array.length; i++) {
float32Array[i] = int16Array[i] / 32768.;
}
playNode.port.postMessage({ message: 'audioData', audioData: float32Array });
});
} else {
// The server will return the synthesized result
const {elapsed, progress, duration, size } = JSON.parse(e.data);
this.elapsedTime = elapsed;
}
};Send text to the server, and the server will return the synthesized audio data.
textis the text to be synthesized.sampleratecan be set in the query string, default is 16000.sidis the Speaker ID, default is 0.speedis the speed of the synthesized audio, default is 1.0.
curl -X POST "http://localhost:8000/tts" \
-H "Content-Type: application/json" \
-d '{
"text": "Hello, world!",
"sid": 0,
"samplerate": 16000
}' -o helloworkd.wavSend an audio file (wav or ogg) to the server, and the server will return the transcription with timestamps for each segment.
file: The audio file to transcribe (wav or ogg).samplerate: Target sample rate for processing, default is 16000.
The server will return the transcription results in JSON format, with the following fields:
segments: An array of transcription segments, each containing:text: The transcribed text for the segment.finished: Always true for file processing.idx: The index of the segment.start: The start time of the segment in seconds.end: The end time of the segment in seconds.
curl -X POST "http://localhost:8000/asr_file" \
-F "[email protected]" \
-o result.jsonAll models are stored in the models directory
Only download the models you need. default models are:
- asr models:
sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20(Bilingual, Chinese + English). Streaming - tts models:
vits-zh-hf-theresa(Chinese + English)
silero is required for ASR
mkdir -p silero_vad
cd silero_vad
curl -SL -o silero_vad/silero_vad.onnx https://github.com/snakers4/silero-vad/raw/master/src/silero_vad/data/silero_vad.onnxcurl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-fire-red-asr-large-zh_en-2025-02-16.tar.bz2curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/kokoro-multi-lang-v1_0.tar.bz2curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-zh-hf-theresa.tar.bz2curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-melo-tts-zh_en.tar.bz2curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20.tar.bz2curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-paraformer-trilingual-zh-cantonese-en.tar.bz2curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-whisper-tiny.en.tar.bz2curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17.tar.bz2curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-int8-2025-09-09.tar.bz2curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-streaming-paraformer-bilingual-zh-en.tar.bz2curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-paraformer-trilingual-zh-cantonese-en.tar.bz2curl -SL -O https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-paraformer-en-2024-03-09.tar.bz2