Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ZhaoLongSYSU/PocketDTA

Repository files navigation

PocketDTA: An Advanced Multimodal Architecture for Enhanced Prediction of Drug-Target Affinity from 3D Structural Data of Target Binding Pockets

python pytorch rdkit torch-geometric deepchem License: MIT


πŸ“‹ Table of Contents


πŸ“Œ Introduction

Motivation

Accurately predicting drug-target binding affinity (DTA) is crucial for drug discovery and repurposing. Although deep learning has been widely used in this field, it still faces three major challenges:

  1. Insufficient generalization performance across different protein families and chemical spaces
  2. Inadequate use of 3D structural information from both proteins and small molecules
  3. Poor interpretability limiting the understanding of binding mechanisms

Our Solution

To address these challenges, we developed PocketDTA, an advanced multimodal deep learning architecture that:

  • βœ… Enhances generalization through pre-trained models (ESM-2 for proteins, GraphMVP for molecules)
  • βœ… Leverages 3D structural data from protein binding pockets and drug conformations
  • βœ… Improves interpretability via bilinear attention networks to identify key interactions
  • βœ… Processes multiple binding sites by handling top-3 predicted binding pockets
  • βœ… Achieves SOTA performance on benchmark datasets (Davis and KIBA)

Results Highlights

  • πŸ† Outperforms existing methods in comparative analysis on optimized datasets
  • πŸ”¬ Validated interpretability through molecular docking and literature confirmation
  • πŸ’ͺ Robust generalization demonstrated in cold-start experiments
  • 🎯 Identifies key interactions between drug functional groups and amino acid residues

✨ Key Features

  • 🧬 Advanced Protein Encoding: Utilizes ESM-2 protein language model for rich sequence representations
  • πŸ’Š 3D Molecular Representations: Employs GraphMVP for learning from 3D molecular structures
  • πŸ” Multi-Pocket Analysis: Processes top-3 binding pockets for comprehensive protein-drug interactions
  • 🧠 GVP-GNN Architecture: Custom Geometric Vector Perceptron Graph Neural Networks for 3D geometry
  • 🎨 Bilinear Attention: Captures cross-modal interactions between proteins and molecules
  • πŸ“Š High Interpretability: Provides attention weights for understanding binding mechanisms

πŸš€ Architecture

PocketDTA Architecture

The PocketDTA architecture consists of:

  1. Protein Branch:

    • ESM-2 sequence encoder
    • GVP-GNN for 3D binding pocket structures (top-3 pockets)
    • Multi-head attention for pocket aggregation
  2. Molecule Branch:

    • GraphMVP encoder for 3D molecular conformations
    • Graph neural network for molecular feature extraction
  3. Interaction Module:

    • Bilinear attention network
    • Cross-modal fusion layer
    • Affinity prediction head

πŸ’» Installation

Prerequisites

  • Operating System: Linux (recommended), Windows, or macOS
  • Python: 3.7.16
  • CUDA: 11.7 (for GPU support)
  • RAM: 24GB+ recommended for training

Step 1: Clone the Repository

git clone https://github.com/zhaolongNCU/PocketDTA.git
cd PocketDTA

Step 2: Create Conda Environment

# Create a new conda environment
conda create -n PocketDTA python=3.7
conda activate PocketDTA

Step 3: Install Dependencies

# Install PyTorch with CUDA support
pip install torch==1.13.0+cu117 torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu117

# Install PyTorch Geometric and dependencies
pip install torch-geometric==2.3.1
pip install torch-scatter torch-sparse torch-cluster -f https://data.pyg.org/whl/torch-1.13.0+cu117.html

# Install other dependencies
pip install -r requirements.txt

Alternative: One-Command Installation

pip install -r requirements.txt

Note: The requirements.txt contains all necessary packages with specified versions.


πŸš€ Quick Start

1. Download Pre-trained Models

Download the required pre-trained model weights:

  • ESM-2 (650M parameters): Download Link
  • GraphMVP: Already included in dataset/Davis/ and dataset/KIBA/

Place ESM-2 model file in the appropriate directory (see Pre-trained Models).

2. Download Dataset Files

Download 3D structure files from Google Cloud Drive:

  • Target 3D structures (.pdb files)
  • Top-3 binding pocket files (.pdb files)

Place them in the corresponding dataset folders.

3. Train the Model

# Train on Davis dataset with seed 0
python main.py --task Davis --r 0

# Train on KIBA dataset
python main.py --task KIBA --r 0

4. Evaluate Performance

Results will be saved in the result/ directory, including:

  • Training/validation metrics
  • Model checkpoints
  • Prediction results

πŸ”§ Data Preprocessing

We provide complete data preprocessing pipelines for preparing custom datasets or reproducing our results from scratch.

Pipeline Overview

The data_preprocess/ directory contains standalone preprocessing tools for both Davis and KIBA datasets:

data_preprocess/
β”œβ”€β”€ Davis/              # Davis dataset preprocessing pipeline
β”‚   β”œβ”€β”€ README.md      # Complete documentation and usage guide
β”‚   └── ...            # Preprocessing scripts
β”‚
β”œβ”€β”€ KIBA/              # KIBA dataset preprocessing pipeline
β”‚   β”œβ”€β”€ README.md      # Complete documentation and usage guide
β”‚   └── ...            # Preprocessing scripts
β”‚
└── requirements.txt   # Preprocessing dependencies

Quick Start

# Navigate to dataset-specific directory
cd data_preprocess/Davis  # or KIBA

# Run automated pipeline
python run_pipeline.py --step all

Workflow

The pipeline processes: AlphaFold PDB files β†’ DoGSite3 pockets β†’ Graph representations

Final outputs: {Dataset}_Domain_coord_graph_top{1,2,3}seqid_dict.pickle

πŸ“– For detailed instructions, custom dataset processing, and troubleshooting, see:


Benchmark Datasets

We provide two benchmark datasets for drug-target affinity prediction:

Dataset Proteins Compounds Interactions Affinity Range Source
Davis 442 68 30,056 Kd values Kinase inhibitors
KIBA 229 2,111 118,254 KIBA scores Bioactivity database

Dataset Structure

dataset/
β”œβ”€β”€ Davis/
β”‚   β”œβ”€β”€ process.csv                          # Protein-compound pairs
β”‚   β”œβ”€β”€ GraphMVP.pth                         # Pre-trained GraphMVP model
β”‚   β”œβ”€β”€ Davis_Domain_coord_graph_top1seqid_dict.pickle
β”‚   β”œβ”€β”€ Davis_Domain_coord_graph_top2seqid_dict.pickle
β”‚   └── Davis_Domain_coord_graph_top3seqid_dict.pickle
β”‚
└── KIBA/
    └── (same structure as Davis/)

Required Downloads

Download from Google Cloud Drive:

  • Protein 3D structures (.pdb files from AlphaFold)
  • Binding pocket files (top-3 predicted pockets from DoGSite3)

Data Format

Each .pickle file contains a dictionary:

{
    'PROTEIN_ID': torch_geometric.data.Data(
        x=...,          # Coordinates
        seq=...,        # Sequence
        node_s=...,     # Scalar features
        node_v=...,     # Vector features
        edge_s=...,     # Edge scalar features
        edge_v=...,     # Edge vector features
        edge_index=..., # Graph structure
    ),
    ...
}

🎯 Pre-trained Models

PocketDTA leverages several pre-trained models for enhanced performance:

Required Models

Model Purpose Size Download Link
ESM-2 Protein sequence encoding 650M params Download
GraphMVP Molecular graph pre-training Included In dataset/ folders

Optional Models (for ablation studies)

Model Purpose Download Link
ProtBert Alternative protein encoder Zenodo
ProtT5 Alternative protein encoder Zenodo
3DInfomax 3D molecular pre-training GitHub

Model Placement

Place downloaded models in the appropriate directories as specified in the configuration files in configs/.


πŸŽ“ Training

Basic Training

Train on Davis dataset:

python main.py --task Davis --r 0

Train on KIBA dataset:

python main.py --task KIBA --r 0

Multi-Seed Training

Run experiments with multiple random seeds (Linux):

./training.sh

This script trains models with seeds 0-4 for robust performance evaluation.

Training Parameters

Key command-line arguments:

  • --task: Dataset to use (Davis or KIBA)
  • --r: Random seed for reproducibility
  • --epochs: Number of training epochs (default: 1000)
  • --batch-size: Batch size (default: 128)
  • --lr: Learning rate (default: 0.0001)
  • --device: GPU device (default: cuda:0)

Monitoring Training

Training logs and checkpoints are saved in result/{task}/:

  • Training/validation losses
  • Performance metrics (MSE, CI, RΒ², etc.)
  • Model checkpoints
  • Prediction results

πŸ“Š Evaluation

Ablation Studies

Representation Ablation - Test different pre-trained encoders:

./Ablation.sh

This compares:

  • ESM-2 vs ProtBert vs ProtT5 (protein encoders)
  • GraphMVP vs 3DInfomax (molecular encoders)

Module Ablation - Test model components:

./Ablation_module.sh

This evaluates:

  • Impact of multi-pocket processing
  • Effect of bilinear attention
  • Contribution of 3D geometric features

Cold-Start Experiments

Test generalization to unseen proteins/compounds:

./Cold.sh

Three scenarios:

  • Cold-start proteins: Unseen target proteins
  • Cold-start compounds: Unseen drug molecules
  • Cold-start pairs: Both protein and compound unseen

πŸ”¬ Interpretability Analysis

Identify key molecular interactions using attention weights:

python interaction_weight.py --task Davis --model DTAPredictor_test --r 2 --use-test True

Outputs

The script generates:

  • Atomic attention weights: Important functional groups in the drug
  • Residue attention weights: Key amino acids in the binding pocket
  • Visualization files: Heatmaps of protein-drug interactions

Analysis Workflow

  1. Select protein-compound pair for analysis
  2. Run interaction_weight.py to compute attention weights
  3. Visualize interactions using molecular docking software
  4. Validate findings with literature and experimental data

Use Cases

  • Drug design: Identify modifications to improve binding
  • Lead optimization: Focus on key interaction sites
  • Mechanism understanding: Explain binding affinity predictions

πŸ“ Project Structure

PocketDTA/
β”œβ”€β”€ README.md                    # This file
β”œβ”€β”€ requirements.txt             # Python dependencies
β”œβ”€β”€ PocketDTA.jpg               # Architecture diagram
β”‚
β”œβ”€β”€ data_preprocess/            # Data preprocessing pipelines
β”‚   β”œβ”€β”€ Davis/                  # Davis dataset processing
β”‚   β”œβ”€β”€ KIBA/                   # KIBA dataset processing
β”‚   └── requirements.txt        # Preprocessing dependencies
β”‚
β”œβ”€β”€ dataset/                    # Processed datasets
β”‚   β”œβ”€β”€ Davis/                  # Davis dataset files
β”‚   └── KIBA/                   # KIBA dataset files
β”‚
β”œβ”€β”€ configs/                    # Configuration files
β”‚
β”œβ”€β”€ src/                        # Source code
β”‚   β”œβ”€β”€ model.py               # Main PocketDTA model
β”‚   β”œβ”€β”€ gvp_gnn.py             # GVP-GNN implementation
β”‚   β”œβ”€β”€ compound_gnn_model.py  # Molecular encoder
β”‚   β”œβ”€β”€ protein_to_graph.py    # Protein graph construction
β”‚   β”œβ”€β”€ data.py                # Data loading utilities
β”‚   └── featurizers/           # Feature extraction modules
β”‚
β”œβ”€β”€ main.py                     # Training script
β”œβ”€β”€ train_test.py              # Training/testing functions
β”œβ”€β”€ interaction_weight.py       # Interpretability analysis
β”œβ”€β”€ utils_dta.py               # Utility functions
β”‚
β”œβ”€β”€ Radam.py                    # RAdam optimizer
β”œβ”€β”€ lookahead.py               # Lookahead optimizer
β”‚
β”œβ”€β”€ training.sh                 # Multi-seed training
β”œβ”€β”€ Ablation.sh                # Representation ablation
β”œβ”€β”€ Ablation_module.sh         # Module ablation
└── Cold.sh                    # Cold-start experiments

πŸ“– Citation

If you use this code in your research, please cite our paper:

@article{10.1093/bioinformatics/btae594,
    author = {Zhao, Long and Wang, Hongmei and Shi, Shaoping},
    title = "{PocketDTA: An advanced multimodal architecture for enhanced prediction 
             of drug-target affinity from 3D structural data of target binding pockets}",
    journal = {Bioinformatics},
    pages = {btae594},
    year = {2024},
    month = {10},
    doi = {10.1093/bioinformatics/btae594},
    url = {https://doi.org/10.1093/bioinformatics/btae594},
}

πŸ“„ License

This project is released under the MIT License. See LICENSE file for details.

Please also check licenses for:

  • AlphaFold structures: CC-BY 4.0
  • DoGSite3: Academic use only
  • Pre-trained models: Check individual model licenses

πŸ”— Related Resources


⭐ If you find this project helpful, please consider giving it a star!

About

PocketDTA

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published