A PyTorch Lightning framework for molecular property prediction. This project implements a Mean Teacher semi-supervised learning approach using an Attentive GINE backbone, designed to improve performance on drug discovery datasets with sparse labels.
Clone the repo and set up the environment.
# Clone repository
git clone https://github.com/pablorocg/semi-supervised-gnn-drug-discovery
cd semi-supervised-gnn-drug-discovery
# Create environment (recommended)
conda create -n gnn_env python=3.11 -y
conda activate gnn_env
# Install dependencies
pip install -r requirements.txtThe project needs to know where to save data and logs.
- Create a
.envfile from the template:cp .env.template .env
- Open
.envand set your absolute paths:SOURCE_DATA_DIR=/abs/path/to/data # Datasets will be downloaded here CONFIGS_DIR=/abs/path/to/config # Path to the 'config' folder in this repo LOGS_DIR=/abs/path/to/logs # Where to save training logs
Run experiments using the scripts in src/trainers/. You can override parameters (like dataset or model) directly from the command line.
Standard training using only labeled data.
python -m src.trainers.baseline_trainer \
dataset.init.name=SIDER \
"dataset.init.splits=[0.67, 0.03, 0.1, 0.2]"\
dataset.init.batch_size_train=16 \
dataset.init.mu=5Training using both labeled and unlabeled data. Ideal for low-data regimes.
python -m src.trainers.mean_teacher_trainer \
dataset.init.name=SIDER \
"dataset.init.splits=[0.35, 0.35, 0.1, 0.2]" \
dataset.init.batch_size_train=32 \
dataset.init.mu=1config/: Hydra configuration files (datasets, models, training params).src/data/: DataModules for MoleculeNet and OGB datasets.src/models/: GNN implementations (GINE, Attentive Encoders).src/lightning_modules/: PyTorch Lightning modules for Baseline and Mean Teacher logic.src/trainers/: Entry points for training scripts.