This repository contains the official implementation of PAST: Phonetic-Acoustic Speech Tokenizer.
PAST is a unified framework that captures both phonetic and acoustic representations, outperforming previous hybrid speech tokenizers on phonetic accuracy, reconstruction fidelity, and speech language modeling performance.
Unlike prior approaches that rely on self-supervised models and vocoders, PAST uses direct phonetic supervision and a joint training scheme. We also introduce a streamable, causal variant for real-time applications.
Figure: Overview of the PAST architecture.
- [2025/07] 🔥 Initial release of code, pretrained weights, and demo page
Audio samples are available on our project demo page.
You can either install PAST directly via pip or clone the repository for development.
Create a fresh environment and install directly from GitHub:
conda create -n past_env python=3.10 -y
conda activate past_env
pip install git+https://github.com/slp-rl/PAST.gitClone the repository and install dependencies manually:
git clone https://github.com/slp-rl/PAST.git
cd PAST
conda create -n past_env python=3.10 -y
conda activate past_env
pip install -r requirements.txt| Model | Variant | Description |
|---|---|---|
| PAST | Full | PAST model trained on LibriSpeech + TIMIT |
| PAST-streamable | Streamable | Causal variant with 20ms look-ahead |
see demo notebook for a full running example!
# ---------------
# load PAST model
# ---------------
import torch
from past.models.past_model import PastModel
model = PastModel.from_pretrained("PAST") # one of ['PAST', 'PAST_streamable']
# ----------------------------------------------------------------------
# Run on audio: PAST expects a batched input format [Batch, Channels, T]
# ----------------------------------------------------------------------
import torchaudio
def read_one_wav(path, target_sr):
wav, sr = torchaudio.load(path)
if sr != target_sr:
wav = torchaudio.transforms.Resample(sr, target_sr)(wav)
if wav.shape[0] == 2:
wav = wav[:1]
return wav.unsqueeze(0)
wav = read_one_wav("assets/1089-134686-0004.flac", model.sample_rate).to(model.device)
with torch.no_grad():
codes, scale = model.encode(wav)
reconstructed = model.decode(codes, scale)Learn how to prepare your dataset for training PAST, including alignment extraction using Wav2Vec2 and manifest generation.
🔗 Training
Step-by-step instructions for training PAST from scratch, with explanations of key configuration parameters and recommended practices.
Evaluate the model’s performance in terms of speech reconstruction quality using SI-SNR and PESQ. The guide also outlines the other evaluation metrics (PNMI, ABX, WER) and explains how we used them in our experiments.
If you use PAST in your work, please cite:
@article{har2025past,
title={Past: Phonetic-acoustic speech tokenizer},
author={Har-Tuv, Nadav and Tal, Or and Adi, Yossi},
journal={arXiv preprint arXiv:2505.14470},
year={2025}
}
This project is released under the MIT License. See the LICENSE file for details.