This projecct is my first truly vibe-coded project so use as your own risk! I have to say Opus 4.5 has the sauce.
FiftyOne is a very neat tool for visualising all sort of datsets. Here for example we have the ESC-50 dataset of environmental sources (helicopters, dogs, etc.). Besides being able to easily view the spectrograms of our audio files we can incorporate embeddings and similiarty search directly into our filtering workflows. To demonstate this I quickly trained a MAE model based on AudioMAE++: https://arxiv.org/abs/2507.10464 to generate embeddings for this project.
You can simply select on a sound file of interest and like magic instantly find your files with similar acoustic structure and texture. WOW!
I simulated the sound files being collected across Dartmoor, a national park near where I grew up, to simulate using geospatial analysis to find trends.
We can use the embeddings to justify intutions we have about relationships between sounds. For example, in this plot we can see that clapping and a helicopter are more similar to each other than a rooster or chicken.

A modular audio/signal machine learning framework with plugin-based architecture for training masked autoencoders. Supports ESC-50 environmental sound classification and extensibility to RF signals (RadioML).
- Plugin Registry System: Decorator-based registration for models, data loaders, and transforms
- Self-Contained Notebooks: Generate portable notebooks for Google Colab
- Automatic Test Generation: Verify plugin interface compliance
- FiftyOne Integration: Visualize embeddings with similarity search and UMAP/t-SNE
uv is a fast Python package installer.
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create virtual environment and install dependencies
uv venv .venv
source .venv/bin/activate
uv pip install -e ".[all]"python -m venv .venv
source .venv/bin/activate
pip install -e ".[all]"| Group | Command | Includes |
|---|---|---|
| Base | pip install -e . |
torch, numpy, pandas, einops, tqdm, pillow |
| Audio | pip install -e ".[audio]" |
+ librosa, scipy |
| Visualization | pip install -e ".[visualization]" |
+ fiftyone, umap-learn, matplotlib |
| Training | pip install -e ".[training]" |
+ mlflow |
| Dev | pip install -e ".[dev]" |
+ pytest, pytest-cov |
| All | pip install -e ".[all]" |
Everything |
# Activate environment
source .venv/bin/activate
# Verify plugins are registered
python -c "from src import model_registry; print(model_registry.list())"
# Output: ['audiomae++', 'baseline'].
├── src/ # Core framework
│ ├── registry.py # PluginRegistry class
│ ├── config.py # Config dataclass
│ ├── models/ # Model plugins
│ │ ├── audiomae.py # AudioMAE++ implementation
│ │ ├── baseline.py # Baseline MAE
│ │ └── classifier.py # Classification wrapper
│ ├── data/ # Data loader plugins
│ │ ├── esc50.py # ESC-50 loader
│ │ └── custom.py # Generic + RF loaders
│ ├── transforms/ # Transform plugins
│ │ ├── audio.py # Audio spectrograms
│ │ └── rf.py # RF spectrograms
│ ├── training/ # Loss functions
│ └── embeddings/ # Embedding utilities
├── tests/ # Test suite
│ ├── generate_tests.py # Auto test generation
│ └── generated/ # Generated tests (gitignored)
├── notebooks/ # Notebook generation
│ ├── generate.py # NotebookGenerator
│ └── generated/ # Generated notebooks (gitignored)
├── data/ # Datasets
│ └── ESC-50-master/ # ESC-50 dataset
└── checkpoints/ # Model checkpoints
from src import model_registry, data_loader_registry, transform_registry
from src.config import Config
# Create a model
config = Config(img_size=224, patch_size=16, embed_dim=768)
model = model_registry.create("audiomae++", config)
# Create a data loader
from pathlib import Path
loader = data_loader_registry.create("esc50", Path("data/ESC-50-master"))
metadata = loader.load_metadata()
# Create a transform
transform = transform_registry.create("audio_spectrogram", img_size=224)import torch
from src import model_registry
from src.config import Config
config = Config()
model = model_registry.create("audiomae++", config)
model.eval()
# Input: batch of spectrograms [B, 3, 224, 224]
x = torch.randn(4, 3, 224, 224)
# Get embeddings
embedding = model.get_embedding(x, pooling_mode="mean") # [4, 768]from src.models.classifier import AudioMAEClassifier
# Wrap pretrained model for classification
classifier = AudioMAEClassifier(model, num_classes=50, freeze_encoder=True)
logits = classifier(spectrograms) # [B, 50]After adding new plugins, generate tests to verify interface compliance:
python tests/generate_tests.pyThis creates test files in tests/generated/:
test_models_interface.py- Model ABC compliancetest_data_loaders_interface.py- Data loader ABC compliancetest_transforms_interface.py- Transform ABC compliancetest_model_architectures.py- Various input size compatibility
# Run all generated tests
python -m pytest tests/generated/ -v
# Run specific test file
python -m pytest tests/generated/test_model_architectures.py -v
# Run with coverage
python -m pytest tests/generated/ --cov=srcCreate self-contained notebooks for Google Colab:
# Generate notebook for AudioMAE++ on ESC-50
python notebooks/generate.py --model audiomae++ --dataset esc50
# List available modules
python notebooks/generate.py --list-modulesGenerated notebooks are saved to notebooks/generated/ and contain all code inline (no external imports required).
# src/models/my_model.py
from src.registry import model_registry
from src.models.base import BaseAutoencoder
@model_registry.register("my-model", version="1.0")
class MyModel(BaseAutoencoder):
def __init__(self, config):
super().__init__()
self.config = config
# ... build model
def forward_encoder(self, x, mask_ratio=0.75):
# Return: latent, mask, ids_restore
...
def get_embedding(self, x, pooling_mode="mean"):
# Return: embedding [B, embed_dim]
...
@property
def embed_dim(self): return self.config.embed_dim
@property
def num_patches(self): return self.config.num_patches# src/data/my_dataset.py
from src.registry import data_loader_registry
from src.data.base import BaseDataLoader
@data_loader_registry.register("my-dataset")
class MyDataLoader(BaseDataLoader):
def __init__(self, data_root):
self.data_root = data_root
def load_metadata(self):
# Return DataFrame with: filepath, label, lat, lon
...
def get_sample_paths(self):
# Return list of Path objects
...
def validate(self):
# Return True if dataset is valid
...# src/transforms/my_transform.py
from src.registry import transform_registry
from src.transforms.base import BaseTransform
@transform_registry.register("my-transform")
class MyTransform(BaseTransform):
def __init__(self, img_size=224):
self.img_size = img_size
def __call__(self, signal, sample_rate):
# Return tensor [3, H, W]
...
@property
def output_channels(self): return 3
@property
def output_size(self): return (self.img_size, self.img_size)After adding a plugin, import it in the corresponding __init__.py to trigger registration.
| Key | Description |
|---|---|
audiomae++ |
AudioMAE++ with Macaron blocks, SwiGLU, RoPE |
baseline |
Standard ViT-MAE for comparison |
| Key | Description |
|---|---|
esc50 |
ESC-50 environmental sounds (2000 clips, 50 classes) |
custom |
Generic audio dataset loader |
rf |
RF/IQ signal dataset loader |
| Key | Description |
|---|---|
audio_spectrogram |
Audio to mel spectrogram (3-channel RGB) |
audio_spectrogram_raw |
Audio to mel spectrogram (1-channel) |
iq_spectrogram |
IQ signal to spectrogram |
iq_constellation |
IQ signal to constellation diagram |
Key configuration options in src/config.py:
from src.config import Config
config = Config(
# Audio processing
sample_rate=22050,
n_mels=128,
audio_duration=5,
# Model architecture
img_size=224,
patch_size=16,
embed_dim=768,
encoder_depth=12,
decoder_depth=8,
# Training
mask_ratio=0.75,
use_contrastive_loss=True,
# Architecture variants
use_macaron=True, # Macaron-style blocks
use_swiglu=True, # SwiGLU activation
use_rope=True, # Rotary position embeddings
)After generating embeddings, visualize with FiftyOne:
import fiftyone as fo
# Load dataset
dataset = fo.load_dataset("esc50_audiomae")
# Launch app
session = fo.launch_app(dataset)
# Similarity search
similar = dataset.sort_by_similarity(sample_id, k=10)MIT