ML-powered chemical toxicity prediction API built with FastAPI, RDKit, and scikit-learn
The Toxicity Predictor API provides machine learning-based predictions for chemical toxicity endpoints including:
- Ames Mutagenicity: Bacterial reverse mutation test (OECD 471)
- Carcinogenicity: Rodent carcinogenicity studies (OECD 451/453)
Built for researchers, regulatory scientists, and pharmaceutical companies who need fast, reliable toxicity predictions for chemical risk assessment.
- Fast Predictions: Sub-second response times for single compounds
- Batch Processing: Handle up to 1000 compounds per request
- File Upload: Direct SDF file processing
- Chemical Lookup: Integrated PubChem database search
- Rich Metadata: Molecular properties, descriptors, confidence scores
- Health Monitoring: Built-in system diagnostics
- Auto Documentation: Interactive API docs with Swagger UI
- Docker Ready: One-command deployment
- Python 3.9+
- Git
git clone https://github.com/Ojochogwu866/Mavhir.git
cd Mavhir
python -m venv venv
source venv/bin/activate
pip install -r requirements.txtpython data/create_example_data.py
python app/models/train_models.pycp .env.example .env
# Edit .env file with your settings (optional)# Development server
python -m app.main
# Or with uvicorn directly
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000# Check health
curl http://localhost:8000/health
# Predict toxicity
curl -X POST "http://localhost:8000/api/v1/predict/smiles" \
-H "Content-Type: application/json" \
-d '{"smiles": "CCO", "endpoints": ["ames_mutagenicity"]}'Once running, visit:
- Swagger UI: http://localhost:8000/docs
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Basic health check |
/api/v1/predict/smiles |
POST | Single compound prediction |
/api/v1/predict/batch |
POST | Batch prediction |
/api/v1/predict/sdf |
POST | SDF file upload |
/api/v1/chemical/lookup/{name} |
GET | PubChem compound lookup |
/api/v1/chemical/validate |
GET | SMILES validation |
import requests
response = requests.post(
"http://localhost:8000/api/v1/predict/smiles",
json={
"smiles": "CN1C=NC2=C1C(=O)N(C(=O)N2C)C", # Caffeine
"endpoints": ["ames_mutagenicity", "carcinogenicity"],
"include_properties": True
}
)
result = response.json()
print(f"Ames prediction: {result['data']['predictions']['ames_mutagenicity']['prediction']}")compounds = [
"CCO", # Ethanol
"CC(=O)O", # Acetic acid
"c1ccccc1" # Benzene
]
response = requests.post(
"http://localhost:8000/api/v1/predict/batch",
json={
"smiles_list": compounds,
"endpoints": ["ames_mutagenicity"]
}
)
results = response.json()
print(f"Processed {results['summary']['successful']} compounds successfully")response = requests.get("http://localhost:8000/api/v1/chemical/lookup/aspirin")
compound_info = response.json()
if compound_info["found"]:
print(f"Aspirin SMILES: {compound_info['canonical_smiles']}")
print(f"Molecular weight: {compound_info['molecular_weight']}")mavhir/
├── app/
│ ├── main.py
│ ├── api/
│ │ ├── health.py
│ │ ├── chemical.py
│ │ └── predict.py
│ ├── core/
│ │ ├── config.py
│ │ ├── models.py
│ │ └── exceptions.py
│ ├── services/
│ │ ├── chemical_processor.py
│ │ ├── descriptor_calculator.py
│ │ ├── predictor.py
│ │ └── pubchem_client.py
│ └── models/
├── data/
├── tests/
├── docs/
└── requirements.txt
# Run all tests
pytest
# Run with coverage
pytest --cov=app --cov-report=html
# Run specific test categories
pytest -m "not benchmark"
pytest tests/test_api.pydocker-compose up --build# Build image
docker build -t mavhir .
# Run container
docker run -p 8000:8000 mavhir- Algorithm: Random Forest Classifier
- Features: ~200 Mordred molecular descriptors
- Training Data: 6,500+ compounds from literature
- Performance: 88% accuracy, 0.91 AUC-ROC
- Endpoint: Bacterial reverse mutation (Salmonella typhimurium)
- Algorithm: Gradient Boosting Classifier
- Features: ~180 Mordred molecular descriptors
- Training Data: 1,200+ compounds from NTP/CPDB
- Performance: 76% accuracy, 0.82 AUC-ROC
- Endpoint: 2-year rodent bioassays
Key environment variables:
# Basic settings
ENVIRONMENT=production
DEBUG=false
MAX_BATCH_SIZE=100
# Model settings
AMES_MODEL_PATH=app/models/ames_mutagenicity.pkl
CARCINOGENICITY_MODEL_PATH=app/models/carcinogenicity.pkl
# Processing settings
DESCRIPTOR_TIMEOUT=60
ENABLE_DESCRIPTOR_CACHING=true
# PubChem API
PUBCHEM_RATE_LIMIT_DELAY=0.2
PUBCHEM_MAX_RETRIES=3- Single prediction: ~150ms average
- Batch processing: ~100 compounds/minute
- Descriptor calculation: ~50ms per compound
- Memory usage: ~500MB base + ~1MB per cached compound
- Fork the repository
- Create feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open Pull Request
# Install development dependencies
pip install -r requirements-dev.txt
# Install pre-commit hooks
pre-commit install
# Run code formatting
black app/ tests/
isort app/ tests/
# Type checking
mypy app/