Thanks to visit codestin.com
Credit goes to github.com

Skip to content

TheStageAI/TheWhisper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TheWhisper: High-Performance Speech-to-Text

License: MIT Hugging Face NVIDIA Apple Silicon

the whisper (6)

🚀 Overview

This repository aims to share and develop the most efficient speech-to-text and text-to-speech inference solution -with a strong focus on self-hosting, cloud hosting, and on-device inference across multiple devices.

For the first release this repository provides open-source transcription models with streaming inference support and:

  • Hugging Face open weights for whisper models with a flexible chunk size (original models have 30s)
  • High-performance TheStage AI inference engines (NVIDIA GPU), 220 tok/s on L40s for whisper-large-v3 model.
  • CoreML engines for macOS / Apple Silicon with the lowest in the world power consumption for MacOS
  • Local RestAPI with frontend examples using JS and Electron see for details
  • Electron demo app built by TheStage AI (Certified by Apple): TheNotes for macOS
thewhisper.mp4

It is optimized for low-latency, low power usage, and scalable streaming transcription. Ideal for real-time captioning, live meetings, voice interfaces, and edge deployments.

📖 Table of Contents


✨ Features

  • Open weights fine-tuned versions of Whisper models
  • Fine-tuned models support inference with 10s, 15s, 20s and 30s
  • CoreML engines for macOS and Apple Silicon, ~2W of power consumption, ~2GB RAM usage
  • Optimized engines for NVIDIA GPUs through TheStage AI ElasticModels (free for small orgs)
  • Streaming implementation (NVIDIA + macOS)
  • Benchmarks: latency, memory, power, and ASR accuracy (OpenASR)
  • Simple Python API, Examples of deployment for MacOS desktop app with Electron and ReactJS
apple m2 whisper (4) nvidia l40s (2)

📦 Quick start

Clone the repository

git clone https://github.com/TheStageAI/TheWhisper.git
cd TheWhisper

Install for Apple

pip install .[apple]

Install for Nvidia

pip install .[nvidia]

Install for Nvidia with TheStage AI optmized engines

pip install .[nvidia]
pip install thestage-elastic-models[nvidia] --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
pip install thestage
# additional dependencies
pip install flash_attn==2.8.2 --no-build-isolation

Then generate access token on TheStage AI Platform in your profile and execute the following command:

thestage config set --api-token <YOUR_API_TOKEN>

🏗️ Support Matrix and System Requirements

Feature whisper-large-v3 (Nvidia) whisper-large-v3 (Apple) whisper-large-v3-turbo (Nvidia) whisper-large-v3-turbo (Apple)
Streaming
Accelerated
Word Timestamps
Multilingual
10s Chunk Mode
15s Chunk Mode
20s Chunk Mode
30s Chunk Mode

Nvidia GPU Requirements

  • Supported GPUs: RTX 4090, L40s
  • Operating System: Ubuntu 20.04+
  • Minimum RAM: 2.5 GB (5 GB recommended for large-v3 model)
  • CUDA Version: 11.8 or higher
  • Driver Version: 520.0 or higher
  • Python version: 3.10-3.12

Apple Silicon Requirements

  • Supported Chipsets: M1, M1 Pro, M1 Max, M1 Ultra, M2, M2 Pro, M2 Max, M2 Ultra, M3, M3 Pro, M3 Max, M4, M4 Pro, M4 Max
  • Operating System: macOS 15.0 (Ventura) or later, iOS 18.0 or later
  • Minimum RAM: 2 GB (4 GB recommended for large-v3 model)
  • Python version: 3.10-3.12

▶️ Usage and Deployment

Apple Usage

import torch
from thestage_speechkit.apple import ASRPipeline

model = ASRPipeline(
    model='TheStageAI/thewhisper-large-v3-turbo',
    # optimized model with ANNA
    model_size='S',
    chunk_length_s=10
)

# inference
result = model(
    "path_to_your_audio.wav", 
    return_timestamps="word"
)

print(result["text"])

Apple Usage with Streaming

from thestage_speechkit.apple import StreamingPipeline
from thestage_speechkit.streaming import MicStream, FileStream, StdoutStream

streaming_pipe = StreamingPipeline(
    model='TheStageAI/thewhisper-large-v3-turbo',
    # Optimized model by ANNA
    model_size='S',
    # Window length
    chunk_length_s=15,
    platform='apple',
    language='en'
)

# set stride in miliseconds
mic_stream = MicStream(step_size_s=0.5)
output_stream = StdoutStream()

while True:
    chunk = mic_stream.next_chunk()
    if chunk:
        approved_text, assumption = streaming_pipe(chunk)
        output_stream.rewrite(approved_text, assumption)
    else:
        break

Nvidia Usage (HuggingFace Transfomers)

import torch
from thestage_speechkit.nvidia import ASRPipeline

model = ASRPipeline(
    model='TheStageAI/thewhisper-large-v3-turbo',
    # allowed: 10s, 15s, 20s, 30s
    chunk_length_s=10,
    # optimized TheStage AI engines
    batch_size=32,
    device='cuda'
)

# inference
result = model(
    audio="path_to_your_audio.wav", 
    chunk_length_s=10,
    generate_kwargs={'do_sample': False, 'use_cache': True}
)

print(result["text"])

Nvidia Usage (TheStage AI engines)

import torch
from thestage_speechkit.nvidia import ASRPipeline

model = ASRPipeline(
    model='TheStageAI/thewhisper-large-v3-turbo',
    # allowed: 10s, 15s, 20s, 30s
    chunk_length_s=10,
    # optimized TheStage AI engines
    mode='S',
    batch_size=32,
    device='cuda'
)

# inference
result = model(
    "path_to_your_audio.wav", 
    chunk_length_s=10,
    generate_kwargs={'do_sample': False, 'use_cache': True}
)

print(result["text"])

💻 Build On-Device Desktop Application for Apple

You can build a macOS desktop app with real-time transcription. Find a simple ReactJS application here: Link to React Frontend You can also download our app built using this backend here: TheNotes for macOS


📊 Quality Benchmarks

TheWhisper is a fine-tuned Whisper model that can process audio chunks of any size up to 30 seconds. Unlike the original Whisper models, it doesn't require padding audio with silence to reach 30 seconds. We conducted quality benchmarking across different chunk sizes: 10, 15, 20, and 30 seconds. For quality benchmarks, we used the multilingual benchmarks Open ASR Leaderboard.

vanilla whisper (1) TheStage AI Whisper (1)

10s chunks

Model Mean WER
openai/whisper-large-v3-turbo 7.81
openai/whisper-large-v3 7.45
thewhisper-large-v3-turbo 7.88
thewhisper-large-v3 7.8

15s chunks

Model Mean WER
openai/whisper-large-v3-turbo 7.61
openai/whisper-large-v3 7.22
thewhisper-large-v3-turbo 7.45
thewhisper-large-v3 7.34

20s chunks

Model Mean WER
openai/whisper-large-v3-turbo 7.63
openai/whisper-large-v3 7.29
thewhisper-large-v3-turbo 7.47
thewhisper-large-v3 7.31

30s chunks

Model Mean WER
openai/whisper-large-v3-turbo 7.61
openai/whisper-large-v3 7.32
thewhisper-large-v3-turbo 7.45
thewhisper-large-v3 7.28

🏢 Enterprise License Summary

To get commercial license for bigger number of GPUs to use TheStage AI optimized engines please contact us here: Service request

Platform Engine Type Status License
NVIDIA GPUs (CUDA) Pytorch HF Transformers ✅ Stable Free
macOS / Apple Silicon CoreML Engine + MLX ✅ Stable Free
NVIDIA GPUs (CUDA) TheStage AI (Optimized) ✅ Stable Free ≤ 4 GPUs/year for small orgs

🏃 Ongoing development

  • Ready-to-go containers for inference on Nvidia GPUs with OpenAI compatible API
  • Nvidia Jetson support
  • Time stamps support on Nvidia
  • Streaming containers for Nvidia
  • Speaker diarization, speaker identification

🙌 Acknowledgements

  • Silero VAD: Used for voice activity detection in thestage_speechkit/vad.py. See @snakers4.
  • OpenAI Whisper: Original Whisper model and pretrained checkpoints. See @openai.
  • Hugging Face Transformers: Model, tokenizer, and inference utilities. See @transformers.
  • MLX community: MLX Whisper implementation for Apple Silicon. See @mlx-explore.