Enterprise-grade secure ASR diarization pipeline combining automatic speech recognition with speaker diarization. HIPAA-compliant with modular architecture and comprehensive security.
- 🔒 Enterprise Security: API key authentication, input validation, rate limiting
- 🎯 High-Quality Processing: DER ~8-20%, WER ~1-5% with robust speaker attribution
- 🩺 HIPAA Compliance: Secure file handling, audit logging, encrypted storage
- 🏗️ Modular Architecture: Clean separation into focused modules
- 🐳 Production Ready: Container-ready with security enhancements
- GPU: NVIDIA GPU with CUDA 13.0+ (8GB+ VRAM recommended)
- OS: Linux (Ubuntu 24.04+, CentOS 8+)
- Python: 3.12+
- Models: Access required for
nvidia/parakeet-tdt-1.1bandpyannote/speaker-diarization-community-1
git clone https://github.com/SunPCSolutions/DiarASR.git
cd DiarASR
python3 -m venv .venv
source .venv/bin/activate
pip install -r app/requirements.txt
cp .env.example .env
# Edit .env with your API keys and HuggingFace tokengit clone https://github.com/SunPCSolutions/DiarASR.git
cd DiarASR
# Create required directories and set ownership for container user (1001:1001)
mkdir -p cache tmp
sudo chown -R 1001:1001 cache tmp
cp .env.example .env
# Edit .env with your API keys and HuggingFace token
docker-compose up --build -dimport os
from app.app import process_audio
os.environ['HF_TOKEN'] = 'your-huggingface-token'
os.environ['API_KEYS'] = 'your-api-key'
result = process_audio(
audio_path='audio.wav',
diarize=True,
min_speakers=2,
max_speakers=4
)
for segment in result['segments']:
print(f"{segment['speaker']}: {segment['text']}")# Start server
export API_KEYS="your-api-key"
export HF_TOKEN="hf_xxx"
uvicorn app:app --host 0.0.0.0 --port 8003
# Make request
curl -H "X-API-Key: your-api-key" \
-X POST "http://localhost:8003/transcribe_diarize/" \
-F "[email protected]"See docs/API_PARAMETERS.md for complete API documentation.
Required HuggingFace Access:
- nvidia/parakeet-tdt-1.1b - ASR model
- pyannote/speaker-diarization-community-1 - Diarization model
Set HF_TOKEN environment variable with your HuggingFace token.
- ASR Accuracy: ~1-5% WER (Parakeet TDT-1.1B)
- Diarization Quality: ~8-20% DER (Pyannote Community-1)
- Processing Speed: ~12x real-time with GPU
- Memory Usage: <8GB VRAM
- API key authentication
- Multi-layer input validation
- Rate limiting (10 req/min)
- Encrypted temporary storage
- HIPAA-compliant processing
- Comprehensive audit logging
docs/API_PARAMETERS.md- Complete API referencememory-bank/systemPatterns.md- Architecture detailsmemory-bank/techContext.md- Technical context
MIT License - see LICENSE file for details.
Our greatest appreciation to the creators of:
- Pyannote.audio (Hervé Bredin et al.) for speaker diarization
- NVIDIA Parakeet TDT (NVIDIA NeMo team) for ASR
- FastAPI (Sebastián Ramírez) for the web framework
- PyTorch (Facebook AI Research) for deep learning
Please cite these works if used in your research.