A cross-platform desktop application for Text-to-Speech using Qwen3-TTS models with voice cloning, custom voices, and voice design capabilities.
- Voice Cloning: Clone any voice from a reference audio sample
- TTS Custom Voice: Generate speech using preset voice models
- Voice Design: Create custom voices by adjusting age, gender, accent, and emotion
- Auto Model Download: Automatically downloads Qwen3-TTS models from HuggingFace
- Cross-Platform: Works on Windows, macOS, and Linux
- Dark/Light Mode: Comfortable interface for any lighting condition
- Frontend: Electron + React + TypeScript + Tailwind CSS
- Backend: Python + FastAPI + PyTorch
- Models: Qwen3-TTS-1.7B from HuggingFace
- Node.js 18+ and npm
- Python 3.9+
- Git
- 8GB+ RAM recommended
- 10GB+ disk space for models
- GPU optional but recommended (CUDA-compatible)
cd c:\Development\GemTTScd backend
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
# source venv/bin/activate
# Install dependencies
pip install -r requirements.txtcd ../frontend
# Install dependencies
npm installTerminal 1 - Backend:
cd backend
venv\Scripts\activate # On Windows
python main.pyTerminal 2 - Frontend:
cd frontend
npm run electron:devcd frontend
npm run electron:dev(This will automatically start the Python backend)
cd frontend
npm run electron:build:win # For Windows
npm run electron:build:mac # For macOS
npm run electron:build:linux # For LinuxThe built application will be in frontend/dist-electron/
-
First Launch: The application will prompt you to download the Qwen3-TTS models. This is a one-time setup that may take 10-30 minutes depending on your internet connection.
-
Voice Cloning Tab:
- Upload a reference audio file (WAV, MP3, etc.)
- Enter the text you want to speak
- Adjust similarity and speed parameters
- Click "Generate Voice"
-
TTS Custom Voice Tab:
- Select a preset voice from the dropdown
- Enter your text
- Adjust speed and pitch
- Click "Generate Speech"
-
Voice Design Tab:
- Enter your text
- Adjust age, gender, accent, and emotion sliders
- Save/load presets for later use
- Click "Generate Voice"
GemTTS/
├── backend/
│ ├── main.py # FastAPI server
│ ├── model_manager.py # Model download and management
│ ├── tts_processor.py # TTS inference logic
│ ├── requirements.txt # Python dependencies
│ ├── models/ # Downloaded models (auto-created)
│ ├── uploads/ # Uploaded audio files (auto-created)
│ └── outputs/ # Generated audio (auto-created)
├── frontend/
│ ├── electron/
│ │ ├── main.js # Electron main process
│ │ └── preload.js # Electron preload script
│ ├── src/
│ │ ├── components/ # React components
│ │ │ ├── VoiceCloning.tsx
│ │ │ ├── TTSCustomVoice.tsx
│ │ │ ├── VoiceDesign.tsx
│ │ │ └── ModelsStatus.tsx
│ │ ├── api/
│ │ │ └── apiService.ts # API client
│ │ ├── App.tsx # Main app component
│ │ ├── App.css # Styles
│ │ └── main.tsx # Entry point
│ ├── package.json
│ ├── vite.config.ts
│ └── tsconfig.json
├── README.md
└── LICENSE
Edit backend/main.py to change:
- API host/port (default: 127.0.0.1:8000)
- Model paths
- Output settings
Edit frontend/src/api/apiService.ts to change:
- API endpoint URL
- Request timeouts
- Check your internet connection
- Ensure you have enough disk space (10GB+)
- Check HuggingFace is accessible from your network
- Ensure models are fully downloaded
- Check Python backend logs in the terminal
- Verify your system has enough RAM (8GB+ recommended)
- Install CUDA toolkit (11.8+)
- Install PyTorch with CUDA support:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
- Ensure Python backend is running first
- Check that port 8000 is not in use
- Look for errors in the terminal
- Backend: Add new endpoints in
main.pyand processing logic intts_processor.py - Frontend: Create new components in
src/components/and wire them up inApp.tsx
# Backend
cd backend
pytest
# Frontend
cd frontend
npm testMIT License - see LICENSE file for details
- Qwen3-TTS by Alibaba Cloud
- Built with Electron, React, FastAPI, and PyTorch
For issues and questions, please open an issue on GitHub.
- Batch processing multiple texts
- Voice preset library with community voices
- Real-time voice morphing
- SSML support for advanced text markup
- Multi-language support
- Voice fine-tuning interface
- Audio effects and post-processing
- Export to multiple formats