TheWhisper: High-Performance Speech-to-Text

🚀 Overview

This repository aims to share and develop the most efficient speech-to-text and text-to-speech inference solution -with a strong focus on self-hosting, cloud hosting, and on-device inference across multiple devices.

For the first release this repository provides open-source transcription models with streaming inference support and:

Hugging Face open weights for whisper models with a flexible chunk size (original models have 30s)
High-performance TheStage AI inference engines (NVIDIA GPU), 220 tok/s on L40s for whisper-large-v3 model.
CoreML engines for macOS / Apple Silicon with the lowest in the world power consumption for MacOS
Local RestAPI with frontend examples using JS and Electron see for details
Electron demo app built by TheStage AI (Certified by Apple): TheNotes for macOS

thewhisper.mp4

It is optimized for low-latency, low power usage, and scalable streaming transcription. Ideal for real-time captioning, live meetings, voice interfaces, and edge deployments.

📖 Table of Contents

✨ Features

Open weights fine-tuned versions of Whisper models
Fine-tuned models support inference with 10s, 15s, 20s and 30s
CoreML engines for macOS and Apple Silicon, ~2W of power consumption, ~2GB RAM usage
Optimized engines for NVIDIA GPUs through TheStage AI ElasticModels (free for small orgs)
Streaming implementation (NVIDIA + macOS)
Benchmarks: latency, memory, power, and ASR accuracy (OpenASR)
Simple Python API, Examples of deployment for MacOS desktop app with Electron and ReactJS

📦 Quick start

Clone the repository

git clone https://github.com/TheStageAI/TheWhisper.git
cd TheWhisper

Install for Apple

pip install .[apple]

Install for Nvidia

pip install .[nvidia]

Install for Nvidia with TheStage AI optmized engines

pip install .[nvidia]
pip install thestage-elastic-models[nvidia] --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
pip install thestage
# additional dependencies
pip install flash_attn==2.8.2 --no-build-isolation

Then generate access token on TheStage AI Platform in your profile and execute the following command:

thestage config set --api-token <YOUR_API_TOKEN>

🏗️ Support Matrix and System Requirements

Feature	whisper-large-v3 (Nvidia)	whisper-large-v3 (Apple)	whisper-large-v3-turbo (Nvidia)	whisper-large-v3-turbo (Apple)
Streaming	❌	✅	❌	✅
Accelerated	✅	✅	✅	✅
Word Timestamps	❌	✅	❌	✅
Multilingual	✅	✅	✅	✅
10s Chunk Mode	✅	✅	✅	✅
15s Chunk Mode	✅	✅	✅	✅
20s Chunk Mode	✅	✅	✅	✅
30s Chunk Mode	✅	✅	✅	✅

Nvidia GPU Requirements

Supported GPUs: RTX 4090, L40s
Operating System: Ubuntu 20.04+
Minimum RAM: 2.5 GB (5 GB recommended for large-v3 model)
CUDA Version: 11.8 or higher
Driver Version: 520.0 or higher
Python version: 3.10-3.12

Apple Silicon Requirements

Supported Chipsets: M1, M1 Pro, M1 Max, M1 Ultra, M2, M2 Pro, M2 Max, M2 Ultra, M3, M3 Pro, M3 Max, M4, M4 Pro, M4 Max
Operating System: macOS 15.0 (Ventura) or later, iOS 18.0 or later
Minimum RAM: 2 GB (4 GB recommended for large-v3 model)
Python version: 3.10-3.12

▶️ Usage and Deployment

Apple Usage

import torch
from thestage_speechkit.apple import ASRPipeline

model = ASRPipeline(
    model='TheStageAI/thewhisper-large-v3-turbo',
    # optimized model with ANNA
    model_size='S',
    chunk_length_s=10
)

# inference
result = model(
    "path_to_your_audio.wav", 
    return_timestamps="word"
)

print(result["text"])

Apple Usage with Streaming

from thestage_speechkit.apple import StreamingPipeline
from thestage_speechkit.streaming import MicStream, FileStream, StdoutStream

streaming_pipe = StreamingPipeline(
    model='TheStageAI/thewhisper-large-v3-turbo',
    # Optimized model by ANNA
    model_size='S',
    # Window length
    chunk_length_s=15,
    platform='apple',
    language='en'
)

# set stride in miliseconds
mic_stream = MicStream(step_size_s=0.5)
output_stream = StdoutStream()

while True:
    chunk = mic_stream.next_chunk()
    if chunk:
        approved_text, assumption = streaming_pipe(chunk)
        output_stream.rewrite(approved_text, assumption)
    else:
        break

Nvidia Usage (HuggingFace Transfomers)

import torch
from thestage_speechkit.nvidia import ASRPipeline

model = ASRPipeline(
    model='TheStageAI/thewhisper-large-v3-turbo',
    # allowed: 10s, 15s, 20s, 30s
    chunk_length_s=10,
    # optimized TheStage AI engines
    batch_size=32,
    device='cuda'
)

# inference
result = model(
    audio="path_to_your_audio.wav", 
    chunk_length_s=10,
    generate_kwargs={'do_sample': False, 'use_cache': True}
)

print(result["text"])

Nvidia Usage (TheStage AI engines)

import torch
from thestage_speechkit.nvidia import ASRPipeline

model = ASRPipeline(
    model='TheStageAI/thewhisper-large-v3-turbo',
    # allowed: 10s, 15s, 20s, 30s
    chunk_length_s=10,
    # optimized TheStage AI engines
    mode='S',
    batch_size=32,
    device='cuda'
)

# inference
result = model(
    "path_to_your_audio.wav", 
    chunk_length_s=10,
    generate_kwargs={'do_sample': False, 'use_cache': True}
)

print(result["text"])

💻 Build On-Device Desktop Application for Apple

You can build a macOS desktop app with real-time transcription. Find a simple ReactJS application here: Link to React Frontend You can also download our app built using this backend here: TheNotes for macOS

📊 Quality Benchmarks

TheWhisper is a fine-tuned Whisper model that can process audio chunks of any size up to 30 seconds. Unlike the original Whisper models, it doesn't require padding audio with silence to reach 30 seconds. We conducted quality benchmarking across different chunk sizes: 10, 15, 20, and 30 seconds. For quality benchmarks, we used the multilingual benchmarks Open ASR Leaderboard.

10s chunks

Model	Mean WER
openai/whisper-large-v3-turbo	7.81
openai/whisper-large-v3	7.45
thewhisper-large-v3-turbo	7.88
thewhisper-large-v3	7.8

15s chunks

Model	Mean WER
openai/whisper-large-v3-turbo	7.61
openai/whisper-large-v3	7.22
thewhisper-large-v3-turbo	7.45
thewhisper-large-v3	7.34

20s chunks

Model	Mean WER
openai/whisper-large-v3-turbo	7.63
openai/whisper-large-v3	7.29
thewhisper-large-v3-turbo	7.47
thewhisper-large-v3	7.31

30s chunks

Model	Mean WER
openai/whisper-large-v3-turbo	7.61
openai/whisper-large-v3	7.32
thewhisper-large-v3-turbo	7.45
thewhisper-large-v3	7.28

🏢 Enterprise License Summary

To get commercial license for bigger number of GPUs to use TheStage AI optimized engines please contact us here: Service request

Platform	Engine Type	Status	License
NVIDIA GPUs (CUDA)	Pytorch HF Transformers	✅ Stable	Free
macOS / Apple Silicon	CoreML Engine + MLX	✅ Stable	Free
NVIDIA GPUs (CUDA)	TheStage AI (Optimized)	✅ Stable	Free ≤ 4 GPUs/year for small orgs

🏃 Ongoing development

Ready-to-go containers for inference on Nvidia GPUs with OpenAI compatible API
Nvidia Jetson support
Time stamps support on Nvidia
Streaming containers for Nvidia
Speaker diarization, speaker identification

🙌 Acknowledgements

Silero VAD: Used for voice activity detection in thestage_speechkit/vad.py. See @snakers4.
OpenAI Whisper: Original Whisper model and pretrained checkpoints. See @openai.
Hugging Face Transformers: Model, tokenizer, and inference utilities. See @transformers.
MLX community: MLX Whisper implementation for Apple Silicon. See @mlx-explore.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
electron_app		electron_app
examples		examples
thestage_speechkit		thestage_speechkit
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TheWhisper: High-Performance Speech-to-Text

🚀 Overview

📖 Table of Contents

✨ Features

📦 Quick start

Clone the repository

Install for Apple

Install for Nvidia

Install for Nvidia with TheStage AI optmized engines

🏗️ Support Matrix and System Requirements

Nvidia GPU Requirements

Apple Silicon Requirements

▶️ Usage and Deployment

Apple Usage

Apple Usage with Streaming

Nvidia Usage (HuggingFace Transfomers)

Nvidia Usage (TheStage AI engines)

💻 Build On-Device Desktop Application for Apple

📊 Quality Benchmarks

10s chunks

15s chunks

20s chunks

30s chunks

🏢 Enterprise License Summary

🏃 Ongoing development

🙌 Acknowledgements

About

Uh oh!

Contributors 3

Languages

License

TheStageAI/TheWhisper

Folders and files

Latest commit

History

Repository files navigation

TheWhisper: High-Performance Speech-to-Text

🚀 Overview

📖 Table of Contents

✨ Features

📦 Quick start

Clone the repository

Install for Apple

Install for Nvidia

Install for Nvidia with TheStage AI optmized engines

🏗️ Support Matrix and System Requirements

Nvidia GPU Requirements

Apple Silicon Requirements

▶️ Usage and Deployment

Apple Usage

Apple Usage with Streaming

Nvidia Usage (HuggingFace Transfomers)

Nvidia Usage (TheStage AI engines)

💻 Build On-Device Desktop Application for Apple

📊 Quality Benchmarks

10s chunks

15s chunks

20s chunks

30s chunks

🏢 Enterprise License Summary

🏃 Ongoing development

🙌 Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 3

Languages