- Introduction
- Key Features
- Architecture
- Installation
- Quick Start
- Data Preprocessing
- Dataset
- Pre-trained Models
- Training
- Evaluation
- Interpretability Analysis
- Project Structure
- Citation
- License
Accurately predicting drug-target binding affinity (DTA) is crucial for drug discovery and repurposing. Although deep learning has been widely used in this field, it still faces three major challenges:
- Insufficient generalization performance across different protein families and chemical spaces
- Inadequate use of 3D structural information from both proteins and small molecules
- Poor interpretability limiting the understanding of binding mechanisms
To address these challenges, we developed PocketDTA, an advanced multimodal deep learning architecture that:
- β Enhances generalization through pre-trained models (ESM-2 for proteins, GraphMVP for molecules)
- β Leverages 3D structural data from protein binding pockets and drug conformations
- β Improves interpretability via bilinear attention networks to identify key interactions
- β Processes multiple binding sites by handling top-3 predicted binding pockets
- β Achieves SOTA performance on benchmark datasets (Davis and KIBA)
- π Outperforms existing methods in comparative analysis on optimized datasets
- π¬ Validated interpretability through molecular docking and literature confirmation
- πͺ Robust generalization demonstrated in cold-start experiments
- π― Identifies key interactions between drug functional groups and amino acid residues
- 𧬠Advanced Protein Encoding: Utilizes ESM-2 protein language model for rich sequence representations
- π 3D Molecular Representations: Employs GraphMVP for learning from 3D molecular structures
- π Multi-Pocket Analysis: Processes top-3 binding pockets for comprehensive protein-drug interactions
- π§ GVP-GNN Architecture: Custom Geometric Vector Perceptron Graph Neural Networks for 3D geometry
- π¨ Bilinear Attention: Captures cross-modal interactions between proteins and molecules
- π High Interpretability: Provides attention weights for understanding binding mechanisms
The PocketDTA architecture consists of:
-
Protein Branch:
- ESM-2 sequence encoder
- GVP-GNN for 3D binding pocket structures (top-3 pockets)
- Multi-head attention for pocket aggregation
-
Molecule Branch:
- GraphMVP encoder for 3D molecular conformations
- Graph neural network for molecular feature extraction
-
Interaction Module:
- Bilinear attention network
- Cross-modal fusion layer
- Affinity prediction head
- Operating System: Linux (recommended), Windows, or macOS
- Python: 3.7.16
- CUDA: 11.7 (for GPU support)
- RAM: 24GB+ recommended for training
git clone https://github.com/zhaolongNCU/PocketDTA.git
cd PocketDTA# Create a new conda environment
conda create -n PocketDTA python=3.7
conda activate PocketDTA# Install PyTorch with CUDA support
pip install torch==1.13.0+cu117 torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu117
# Install PyTorch Geometric and dependencies
pip install torch-geometric==2.3.1
pip install torch-scatter torch-sparse torch-cluster -f https://data.pyg.org/whl/torch-1.13.0+cu117.html
# Install other dependencies
pip install -r requirements.txtpip install -r requirements.txtNote: The requirements.txt contains all necessary packages with specified versions.
Download the required pre-trained model weights:
- ESM-2 (650M parameters): Download Link
- GraphMVP: Already included in
dataset/Davis/anddataset/KIBA/
Place ESM-2 model file in the appropriate directory (see Pre-trained Models).
Download 3D structure files from Google Cloud Drive:
- Target 3D structures (
.pdbfiles) - Top-3 binding pocket files (
.pdbfiles)
Place them in the corresponding dataset folders.
# Train on Davis dataset with seed 0
python main.py --task Davis --r 0
# Train on KIBA dataset
python main.py --task KIBA --r 0Results will be saved in the result/ directory, including:
- Training/validation metrics
- Model checkpoints
- Prediction results
We provide complete data preprocessing pipelines for preparing custom datasets or reproducing our results from scratch.
The data_preprocess/ directory contains standalone preprocessing tools for both Davis and KIBA datasets:
data_preprocess/
βββ Davis/ # Davis dataset preprocessing pipeline
β βββ README.md # Complete documentation and usage guide
β βββ ... # Preprocessing scripts
β
βββ KIBA/ # KIBA dataset preprocessing pipeline
β βββ README.md # Complete documentation and usage guide
β βββ ... # Preprocessing scripts
β
βββ requirements.txt # Preprocessing dependencies
# Navigate to dataset-specific directory
cd data_preprocess/Davis # or KIBA
# Run automated pipeline
python run_pipeline.py --step allThe pipeline processes: AlphaFold PDB files β DoGSite3 pockets β Graph representations
Final outputs: {Dataset}_Domain_coord_graph_top{1,2,3}seqid_dict.pickle
π For detailed instructions, custom dataset processing, and troubleshooting, see:
We provide two benchmark datasets for drug-target affinity prediction:
| Dataset | Proteins | Compounds | Interactions | Affinity Range | Source |
|---|---|---|---|---|---|
| Davis | 442 | 68 | 30,056 | Kd values | Kinase inhibitors |
| KIBA | 229 | 2,111 | 118,254 | KIBA scores | Bioactivity database |
dataset/
βββ Davis/
β βββ process.csv # Protein-compound pairs
β βββ GraphMVP.pth # Pre-trained GraphMVP model
β βββ Davis_Domain_coord_graph_top1seqid_dict.pickle
β βββ Davis_Domain_coord_graph_top2seqid_dict.pickle
β βββ Davis_Domain_coord_graph_top3seqid_dict.pickle
β
βββ KIBA/
βββ (same structure as Davis/)
Download from Google Cloud Drive:
- Protein 3D structures (
.pdbfiles from AlphaFold) - Binding pocket files (top-3 predicted pockets from DoGSite3)
Each .pickle file contains a dictionary:
{
'PROTEIN_ID': torch_geometric.data.Data(
x=..., # Coordinates
seq=..., # Sequence
node_s=..., # Scalar features
node_v=..., # Vector features
edge_s=..., # Edge scalar features
edge_v=..., # Edge vector features
edge_index=..., # Graph structure
),
...
}PocketDTA leverages several pre-trained models for enhanced performance:
| Model | Purpose | Size | Download Link |
|---|---|---|---|
| ESM-2 | Protein sequence encoding | 650M params | Download |
| GraphMVP | Molecular graph pre-training | Included | In dataset/ folders |
| Model | Purpose | Download Link |
|---|---|---|
| ProtBert | Alternative protein encoder | Zenodo |
| ProtT5 | Alternative protein encoder | Zenodo |
| 3DInfomax | 3D molecular pre-training | GitHub |
Place downloaded models in the appropriate directories as specified in the configuration files in configs/.
Train on Davis dataset:
python main.py --task Davis --r 0Train on KIBA dataset:
python main.py --task KIBA --r 0Run experiments with multiple random seeds (Linux):
./training.shThis script trains models with seeds 0-4 for robust performance evaluation.
Key command-line arguments:
--task: Dataset to use (DavisorKIBA)--r: Random seed for reproducibility--epochs: Number of training epochs (default: 1000)--batch-size: Batch size (default: 128)--lr: Learning rate (default: 0.0001)--device: GPU device (default: cuda:0)
Training logs and checkpoints are saved in result/{task}/:
- Training/validation losses
- Performance metrics (MSE, CI, RΒ², etc.)
- Model checkpoints
- Prediction results
Representation Ablation - Test different pre-trained encoders:
./Ablation.shThis compares:
- ESM-2 vs ProtBert vs ProtT5 (protein encoders)
- GraphMVP vs 3DInfomax (molecular encoders)
Module Ablation - Test model components:
./Ablation_module.shThis evaluates:
- Impact of multi-pocket processing
- Effect of bilinear attention
- Contribution of 3D geometric features
Test generalization to unseen proteins/compounds:
./Cold.shThree scenarios:
- Cold-start proteins: Unseen target proteins
- Cold-start compounds: Unseen drug molecules
- Cold-start pairs: Both protein and compound unseen
Identify key molecular interactions using attention weights:
python interaction_weight.py --task Davis --model DTAPredictor_test --r 2 --use-test TrueThe script generates:
- Atomic attention weights: Important functional groups in the drug
- Residue attention weights: Key amino acids in the binding pocket
- Visualization files: Heatmaps of protein-drug interactions
- Select protein-compound pair for analysis
- Run
interaction_weight.pyto compute attention weights - Visualize interactions using molecular docking software
- Validate findings with literature and experimental data
- Drug design: Identify modifications to improve binding
- Lead optimization: Focus on key interaction sites
- Mechanism understanding: Explain binding affinity predictions
PocketDTA/
βββ README.md # This file
βββ requirements.txt # Python dependencies
βββ PocketDTA.jpg # Architecture diagram
β
βββ data_preprocess/ # Data preprocessing pipelines
β βββ Davis/ # Davis dataset processing
β βββ KIBA/ # KIBA dataset processing
β βββ requirements.txt # Preprocessing dependencies
β
βββ dataset/ # Processed datasets
β βββ Davis/ # Davis dataset files
β βββ KIBA/ # KIBA dataset files
β
βββ configs/ # Configuration files
β
βββ src/ # Source code
β βββ model.py # Main PocketDTA model
β βββ gvp_gnn.py # GVP-GNN implementation
β βββ compound_gnn_model.py # Molecular encoder
β βββ protein_to_graph.py # Protein graph construction
β βββ data.py # Data loading utilities
β βββ featurizers/ # Feature extraction modules
β
βββ main.py # Training script
βββ train_test.py # Training/testing functions
βββ interaction_weight.py # Interpretability analysis
βββ utils_dta.py # Utility functions
β
βββ Radam.py # RAdam optimizer
βββ lookahead.py # Lookahead optimizer
β
βββ training.sh # Multi-seed training
βββ Ablation.sh # Representation ablation
βββ Ablation_module.sh # Module ablation
βββ Cold.sh # Cold-start experiments
If you use this code in your research, please cite our paper:
@article{10.1093/bioinformatics/btae594,
author = {Zhao, Long and Wang, Hongmei and Shi, Shaoping},
title = "{PocketDTA: An advanced multimodal architecture for enhanced prediction
of drug-target affinity from 3D structural data of target binding pockets}",
journal = {Bioinformatics},
pages = {btae594},
year = {2024},
month = {10},
doi = {10.1093/bioinformatics/btae594},
url = {https://doi.org/10.1093/bioinformatics/btae594},
}This project is released under the MIT License. See LICENSE file for details.
Please also check licenses for:
- AlphaFold structures: CC-BY 4.0
- DoGSite3: Academic use only
- Pre-trained models: Check individual model licenses
- AlphaFold Database
- DoGSite3 Web Server
- ESM-2 Protein Language Model
- GraphMVP
- PyTorch Geometric
- Davis Dataset
- KIBA Dataset
β If you find this project helpful, please consider giving it a star!