A full-stack MVP for multimodal (audio + text) sequence modeling and inference, featuring model training, ONNX export, gRPC-based serving, a Next.js frontend, and monitoring with Prometheus and Grafana.
- Project Structure
- Features
- Setup & Installation
- Training Pipeline
- Model Architecture
- Inference & Serving
- Frontend
- Monitoring
- Deployment (Docker Compose)
- Directory Overview
- Requirements
- License
ssm-multimodal-mvp/
├── data/ # Processed datasets (e.g., synthetic_processed.pt)
├── deployment/ # Dockerfiles, docker-compose, monitoring configs
├── frontend/ # Next.js client app
├── models/ # Model checkpoints and ONNX exports
├── monitoring/ # (empty or for future monitoring scripts)
├── protoc-25.3/ # Protobuf tools
├── serving/ # Go gRPC server and proto definitions
├── training/ # Model training, data prep, and evaluation scripts
├── requirements.txt # Python dependencies
└── ...
- Multimodal Model: Audio (mel spectrogram) + text input, trained with Mamba SSM blocks.
- Training Pipeline: Data preparation, training, evaluation, and ONNX export.
- gRPC Inference Server: Go-based server loads ONNX model and exposes gRPC API.
- gRPC-Web Proxy: Bridges browser/frontend to gRPC backend.
- Next.js Frontend: User interface for inference.
- Monitoring: Prometheus metrics and Grafana dashboards.
- Dockerized: All components containerized for easy deployment.
- Docker & Docker Compose
- NVIDIA GPU + drivers (for training)
- Python 3.8+ (for local training)
- Node.js 18+ (for frontend dev)
cd deployment
docker-compose up --build- Frontend: http://localhost:3000
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3001 (admin/admin)
- Data Preparation:
training/prepare_data.pydownloads and processes LibriSpeech audio, extracts log-mel spectrograms, and tokenizes text using GPT-2 tokenizer. - Model Training:
training/train_model.pytrains theMultimodalMambaModel(see below) on the processed data, logs metrics to TensorBoard, and saves checkpoints. - Quick Training:
training/quick_train.pyprovides a minimal example for rapid prototyping. - ONNX Export: Export trained models to ONNX format for serving.
cd training
python prepare_data.py
python train_model.py
python export_onnx.pyDefined in training/multimodal_mamba.py:
- Audio Encoder: 1D Conv layers project mel spectrograms to model dimension.
- Text Embedding: Standard embedding layer for tokenized text.
- Mamba SSM Blocks: Two stacked Mamba blocks for sequence modeling.
- Output Projection: Linear layer to vocabulary size.
class MultimodalMambaModel(nn.Module):
...
def forward(self, audio_features, text_tokens):
# Encode audio and text, concatenate, pass through Mamba blocks, project to vocab- Go gRPC Server (
serving/go-server/): Loads ONNX model, exposes gRPC API for inference and health checks, and serves Prometheus metrics. - gRPC-Web Proxy: Converts HTTP requests from the frontend to gRPC calls.
- ONNX Models: Place exported models in
models/onnx/.
/api/inference(POST): Accepts audio data and text prompt, returns generated text and confidence./api/health(GET): Health check endpoint.
- Framework: Next.js (TypeScript, Tailwind CSS)
- Location:
frontend/nextjs-client/ - Dev Start:
cd frontend/nextjs-client
npm install
npm run dev- Build & Serve (Docker): Handled by
Dockerfile.frontendand docker-compose.
- Prometheus: Scrapes metrics from inference server.
- Grafana: Pre-configured dashboards for inference latency, request counts, etc.
- Config: See
deployment/prometheus.ymlanddeployment/grafana/.
All services are orchestrated via deployment/docker-compose.yml:
inference-server: gRPC server (Go, ONNX)grpc-web-proxy: HTTP-to-gRPC bridgefrontend: Next.js apptraining: Model training (GPU required)prometheus: Metrics collectiongrafana: Visualization
cd deployment
docker-compose up --build- data/: Processed datasets (e.g.,
synthetic_processed.pt) - models/onnx/: Exported ONNX models for serving
- models/checkpoints/: PyTorch model checkpoints
- training/: All training, data prep, and evaluation scripts
- serving/go-server/: Go gRPC server and proxy
- frontend/nextjs-client/: Next.js frontend app
- deployment/: Dockerfiles, docker-compose, monitoring configs
See requirements.txt for Python dependencies, including:
- torch, torchvision, torchaudio
- mamba-ssm
- transformers, datasets
- librosa, soundfile
- onnx, onnxruntime
- tensorboard, prometheus-client
- jupyter, ipywidgets
MIT License (or specify your license here)
For more details, see the code in each subdirectory and the comments in the scripts.