GPA: One Model for Speech Recognition, Text-to-Speech, and Voice Conversion

TL;DR GPA incorporates three speech tasks into one single model and this repo includes codes of training, fine-tuning and effecient deployment of GPA.

📖 Abstract

GPA stands for General Purpose Audio.

In academia, a student’s GPA (Grade Point Average) serves as a unified metric that reflects performance across diverse subjects—ranging from Calculus and Philosophy to Gym class.

Similarly, our GPA model unifies the three major pillars of audio tasks—Text-to-Speech (TTS), Automatic Speech Recognition (ASR), and Voice Conversion (VC)—into a single auto-regreesive transformer.

Our open-source content includes support for multiple frameworks and provides production-ready code suitable for cloud deployment.
we include concise inference examples and training pipelines for research purpose.
The released 0.3B model is also perfect for edge devices and edge deployment is to be released.

Table of Content

🗺️ Roadmap	🚀 Quick Start	🛠️ Deployment	📊 Evaluation	⚡ Performance
	• Environment Setup • Checkpoint Download • Inference • Training	• Start the Service (Docker) • Start the Gradio GUI • Basic Testing		• Speed • RTF • Concurrency • VRAM Usage

🗺️ Roadmap

Category	Item	Status
Core Features	Unified LLM-based audio generation & understanding	✅
	Inference Scripts (STT, TTS, VC)	✅
	Training Pipeline (DeepSpeed)	✅
	Interactive Demo	✅
	Basic Service Deployment (vLLM/FastAPI)	✅
	Paper (ArXiv)	⬜
Model Releases	GPA-0.3B-preview (Edge-focused)	✅
	GPA-0.3B (Edge-focused)	⬜
Edge Deployment	Android Platform	⬜
	RK Series	⬜
	IOS Platform	⬜
Frameworks	vllm	✅
	llama-cpp	✅
	sglang	✅
	torch	✅
	mlx-lm	✅
	rknn	⬜

🔍 Model Overview

Figure 1: Architecture of the proposed GPA framework. The model utilizes a shared Large Language Model (LLM) backbone to unify three core audio tasks: Understanding (ASR), Generation (TTS), and Editing (Voice Conversion). Depending on the task, the model processes different combinations of inputs (Source Audio, Target Text, or Reference Audio) via Semantic and Acoustic modules to generate the corresponding text or audio output.

🚀 Quick Start

🧹 Environment Setup

🧩 Option A: Reproducible Setup with `uv` (Recommended)

⚠️ Prerequisites (Important)

The default development environment is configured for:

OS: Linux (x86_64)*

GPU: NVIDIA*

CUDA: 12.x*

The provided uv.lock file was generated under this configuration.

If your system matches the above, you can use the uv-based setup for a fully reproducible environment.

If you are using:

CUDA 11.x (e.g. cu116)

CPU-only systems

macOS or Windows

please follow the pip-based installation described below.

We use uv for fast and reproducible Python environment management.

1. Install uv

curl -LsSf https://astral.sh/uv/install.sh | sh

# Or install via pip if you prefer:
# pip install uv

2. Sync the environment (installs all dependencies)

💡Note: If training is not required, or if building flash_attn is difficult/slow on your device, you may comment out this dependency in pyproject.toml. Training should be switched to eager mode in such condition.

uv sync

🧩 Option B: Flexible Setup with pip (Any CUDA / CPU)

1. Create and activate a virtual environment

python -m venv .venv
source .venv/bin/activate

2. Install base dependencies

pip install -r requirements.txt

📥 Checkpoint Download

Before running inference, please download the model checkpoints from Hugging Face or ModelScope.

Model	Hugging Face	ModelScope
GPA-0.3B-preview	Download	Download
GPA-0.3B	Coming Soon	Coming Soon

Important: After downloading the checkpoints, please verify that your model directory structure matches the hierarchy below.

${GPA_MODEL_DIR}/
├── BiCodec/
├──── wav2vec2-large-xlsr-53/
├── glm-4-voice-tokenizer/
├── added_tokens.json
├── chat_template.jinja
├── config.json
├── generation_config.json
├── merges.txt
├── model.safetensors
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── vocab.json

💭 Inference

You can perform various tasks like Speech-to-Text, Text-to-Speech, and Voice Conversion using the provided scripts.

💡Note: Please navigate to the inference directory to ensure relative paths for audio files work correctly.

💡Note: Currently, we only support input in WAV format at a sample rate of 16 kHz.

cd scripts/inference

💡Note: To use other python environments, replace "uv run" with "path_to_your_python".

Speech-to-Text (STT/ASR):

# Using uv
uv run gpa_inference.py --task stt \
    --src_audio_path "test_audio/000.wav" \
    --gpa_model_path "${GPA_MODEL_DIR}" \
    --tokenizer_path "${GPA_MODEL_DIR}/glm-4-voice-tokenizer" \
    --bicodec_tokenizer_path "${GPA_MODEL_DIR}/BiCodec" \
    --text_tokenizer_path "${GPA_MODEL_DIR}"

# Or using python
python gpa_inference.py --task stt \
    --src_audio_path "test_audio/000.wav" \
    --gpa_model_path "${GPA_MODEL_DIR}" \
    --tokenizer_path "${GPA_MODEL_DIR}/glm-4-voice-tokenizer" \
    --bicodec_tokenizer_path "${GPA_MODEL_DIR}/BiCodec" \
    --text_tokenizer_path "${GPA_MODEL_DIR}"

Text-to-Speech (TTS):

# Using uv
uv run gpa_inference.py --task tts-a \
    --text "Hello world, this is Major Tom speaking." \
    --ref_audio_path "test_audio/astro.wav" \
    --gpa_model_path "${GPA_MODEL_DIR}" \
    --tokenizer_path "${GPA_MODEL_DIR}/glm-4-voice-tokenizer" \
    --bicodec_tokenizer_path "${GPA_MODEL_DIR}/BiCodec" \
    --text_tokenizer_path "${GPA_MODEL_DIR}"

# Or using python
python gpa_inference.py --task tts-a \
    --text "Hello world, this is Major Tom speaking." \
    --ref_audio_path "test_audio/astro.wav" \
    --gpa_model_path "${GPA_MODEL_DIR}" \
    --tokenizer_path "${GPA_MODEL_DIR}/glm-4-voice-tokenizer" \
    --bicodec_tokenizer_path "${GPA_MODEL_DIR}/BiCodec" \
    --text_tokenizer_path "${GPA_MODEL_DIR}"

Voice Conversion (VC):

# Using uv
uv run gpa_inference.py --task vc \
    --src_audio_path "test_audio/vc_src.wav" \
    --ref_audio_path "test_audio/astro.wav" \
    --gpa_model_path "${GPA_MODEL_DIR}" \
    --tokenizer_path "${GPA_MODEL_DIR}/glm-4-voice-tokenizer" \
    --bicodec_tokenizer_path "${GPA_MODEL_DIR}/BiCodec" \
    --text_tokenizer_path "${GPA_MODEL_DIR}"

# Or using python
python gpa_inference.py --task vc \
    --src_audio_path "test_audio/vc_src.wav" \
    --ref_audio_path "test_audio/astro.wav" \
    --gpa_model_path "${GPA_MODEL_DIR}" \
    --tokenizer_path "${GPA_MODEL_DIR}/glm-4-voice-tokenizer" \
    --bicodec_tokenizer_path "${GPA_MODEL_DIR}/BiCodec" \
    --text_tokenizer_path "${GPA_MODEL_DIR}"

For more details on inference arguments, check out the Inference README.

🏋️ Training

🏃 Run Training

We provide a training script to help you get started. A small sample dataset is included in the repository to quickly verify that the pipeline works as expected:

scripts/train/merged_shuffled_train.jsonl
scripts/train/dataset

💡Note: Before running, be sure to update the paths in train_gpa.sh.

# Run the training script (uses DeepSpeed)
cd scripts/train
bash train_gpa.sh

The training script automatically handles environment activation via uv run.

📚 Use Your Own Dataset!

Building your own dataset is as simple as following the format of our provided .jsonl example (scripts/train/merged_shuffled_train.jsonl) and pointing it to your prepared data.

🛠️ Deployment

We provide a complete set of scripts for service deployment, including a FastAPI-based backend server, a Gradio-based GUI, and basic testing scripts.

⚠️ Caution: The current vLLM-based deployment may exhibit occasional audio quality degradation under large-scale concurrent workloads. For reliable evaluation and quality validation, we recommend using the basic PyTorch inference implementation provided in the inference module.

1. vLLM Deployment (Docker Recommended)

The core service is built with FastAPI. We utilize a Dockerfile to build the runtime environment, ensuring consistency and ease of deployment.

Ensure you have Docker and Docker Compose installed.
Set the required environment variables (e.g., in a .env file or export them).

Please configure GPA_CODE_ROOT and GPA_MODEL_DIR. For model preparation, refer to Checkpoint Download.
```
export GPA_CODE_ROOT="/absolute/path/to/this/repo"
export GPA_MODEL_DIR="/absolute/path/to/models"
```

Run with Docker Compose:

cd scripts/server
docker compose up -d --build

Test: You can use the provided client script to verify that the service is working correctly.
```
# Run the test client
python test_client.py
```

2. Start the Gradio GUI

We provide a user-friendly web interface for interacting with the API.

💡Note: The GUI uses the original PyTorch deployment instead of vLLM

# Install Gradio if not already installed
pip install gradio

# Start the GUI app
cd scripts/server
python gui_app.py

The GUI will be available at http://localhost:7868.

⚡ Model Performance

The following results are obtained by benchmarking services instantiated via the official deployment scripts, reflecting end-to-end performance in realistic serving scenarios rather than offline inference.

Among currently available open-source systems, our model is one of the few that natively supports both concurrent and streaming inference, while achieving performance comparable to the first tier of existing approaches.

💡Note

TTFC: Time To First Chunk (TTS)

TTFT: Time To First Token (ASR)

RTF: Real-Time Factor (audio duration / synthesis time)

TTS Streaming Benchmark (Latency & Throughput)

Concurrency	Avg TTFC (ms)	P50 TTFC (ms)	P99 TTFC (ms)	Avg RTF	P50 RTF	P99 RTF	Audio Dur (s)
1	258.8	258.8	258.8	0.197	0.197	0.197	6.44
5	385.0	394.7	396.2	0.218	0.217	0.248	6.76
10	544.6	564.2	566.7	0.282	0.301	0.313	6.49
20	977.8	977.9	982.9	0.470	0.490	0.538	7.19
40	1797.0	1736.4	2564.5	0.421	0.400	0.587	6.33
80	3786.4	4054.4	5415.8	0.763	0.763	1.096	6.32
160	9847.9	10239.9	14350.3	1.718	1.740	2.577	6.44

Table 2. TTS Streaming RTF and Audio Duration

ASR Streaming Benchmark

Concurrency	Avg TTFT (ms)	P50 TTFT (ms)	P99 TTFT (ms)	Avg Total (ms)
1	157.5	157.5	157.5	190.9
5	394.1	393.7	395.9	400.0
10	589.6	721.3	723.3	598.1
20	1316.3	1495.6	1500.4	1317.8
40	2690.9	2678.3	2861.4	2693.7
80	3833.4	3961.3	4027.0	3845.1
160	5037.0	5689.3	6676.0	5044.0

Table 3. ASR Streaming Latency vs Concurrency

📊 Evaluation Metric Results

TTS Evaluation Table

Model	Open-Source	Model Size	test-zh CER (%) ↓	test-zh Sim (%) ↑	test-en WER (%) ↓	test-en Sim (%) ↑
Multi-Stage or NAR Methods
Human	-	-	1.26	75.5	2.14	73.4
Seed-TTS	❌	-	1.12	79.6	2.25	76.2
MiniMax-Speech	❌	-	0.83	78.3	1.65	69.2
F5-TTS	✅	0.3B	1.52	74.1	2.00	64.7
CosyVoice2	✅	0.5B	1.45	75.7	2.57	65.9
FireRedTTS2	✅	1.5B	1.14	73.2	1.95	66.5
Index-TTS2	✅	1.5B	1.03	76.5	2.23	70.6
VibeVoice-1.5B	✅	1.5B	1.16	74.4	3.04	68.9
VibeVoice-Realtime	✅	0.5B	-	-	2.05	63.3
HiggsAudio-v2	✅	3B	1.50	74.0	2.44	67.7
VoxCPM	✅	0.5B	0.93	77.2	1.85	72.9
GLM-TTS	✅	1.5B	1.03	76.1	-	-
GLM-TTS RL	✅	1.5B	0.89	76.4	-	-
Fun-CosyVoice3-0.5B-2512	✅	0.5B	1.21	78.0	2.24	71.8
Fun-CosyVoice3-0.5B-2512_RL	✅	0.5B	0.81	77.4	1.68	69.5
One-Stage AR Methods
Spark TTS	✅	0.5B	1.20	66.0	1.98	57.3
GPA-0.3B-preview	✅	0.3B	0.95	65.9	1.51	56.5

ASR Evaluation Table

Note: ASR results on Librispeech and Aishell-1. WER (%) is reported for Librispeech, and CER (%) is reported for Aishell-1.

Model	Model Size	Librispeech test-clean	Aishell-1
Models with < 0.5B parameters
Whisper-S	0.24B	3.13	-
GPA-0.3B-preview	0.3B	8.88	4.50
Models with > 0.5B parameters
Fun-ASR-nano	0.8B	1.76	1.80
FireRed-ASR	1.1B	1.84	0.54
GLM-ASR-nano	1.5B	2.00	1.81
GLM-ASR-nano*	1.5B	2.17	2.17
Whisper-L	1.55B	1.82	4.72
Kimi-Audio	-	1.32	0.71
Step-Audio2	-	1.17	0.63
Seed-ASR	-	1.58	0.68
Seed-ASR*	-	2.80	1.63
Fun-ASR	7.7B	1.51	1.22

🙏 Acknowledgements

We borrowed a lot of code from the following excellent projects:

🔗 Citation

If you find GPA useful for your research or projects, please cite us:

@misc{gpa2026,
  title={Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformer},
  author={Runyuan Cai, Yu Lin, Yiming Wang, Chunlin Fu and Xiaodong Zeng},
  year={2026},
  howpublished={\url{https://github.com/AutoArk/GPA}},
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data_utils		data_utils
docs		docs
models		models
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GPA: One Model for Speech Recognition, Text-to-Speech, and Voice Conversion

📖 Abstract

Table of Content

🗺️ Roadmap

🔍 Model Overview

🚀 Quick Start

🧹 Environment Setup

🧩 Option A: Reproducible Setup with `uv` (Recommended)

⚠️ Prerequisites (Important)

🧩 Option B: Flexible Setup with pip (Any CUDA / CPU)

📥 Checkpoint Download

💭 Inference

🏋️ Training

🏃 Run Training

📚 Use Your Own Dataset!

🛠️ Deployment

1. vLLM Deployment (Docker Recommended)

2. Start the Gradio GUI

⚡ Model Performance

TTS Streaming Benchmark (Latency & Throughput)

ASR Streaming Benchmark

📊 Evaluation Metric Results

TTS Evaluation Table

ASR Evaluation Table

🙏 Acknowledgements

🔗 Citation

About

Uh oh!

Releases

Packages

Languages

License

supermyxxn2/GPA

Folders and files

Latest commit

History

Repository files navigation

GPA: One Model for Speech Recognition, Text-to-Speech, and Voice Conversion

📖 Abstract

Table of Content

🗺️ Roadmap

🔍 Model Overview

🚀 Quick Start

🧹 Environment Setup

🧩 Option A: Reproducible Setup with uv (Recommended)

⚠️ Prerequisites (Important)

🧩 Option B: Flexible Setup with pip (Any CUDA / CPU)

📥 Checkpoint Download

💭 Inference

🏋️ Training

🏃 Run Training

📚 Use Your Own Dataset!

🛠️ Deployment

1. vLLM Deployment (Docker Recommended)

2. Start the Gradio GUI

⚡ Model Performance

TTS Streaming Benchmark (Latency & Throughput)

ASR Streaming Benchmark

📊 Evaluation Metric Results

TTS Evaluation Table

ASR Evaluation Table

🙏 Acknowledgements

🔗 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

🧩 Option A: Reproducible Setup with `uv` (Recommended)

Packages