This repository focuses on a comprehensive Post Drug Repurposing Analysis workflow. The project aims to evaluate candidate drugs identified through repurposing efforts by analyzing their Absorption, Distribution, Metabolism, and Excretion (ADME) profiles, predicting toxicities using the Tox21 dataset, and assessing molecular activity through Quantitative Structure-Activity Relationship (QSAR) modeling. This streamlined approach is designed to accelerate drug discovery while ensuring both safety and efficacy.
- Project Overview
- Current Progress
- Implemented Steps
- Future Enhancements
- Technology Stack
- Example Outputs
Drug repurposing is an efficient strategy to identify new therapeutic uses for existing drugs. This project provides a robust framework for the post-analysis of drug repurposing candidates. It encompasses crucial stages such as ADME screening, toxicity prediction, and QSAR modeling, offering a holistic evaluation of drug candidates. The ultimate goal is to generate actionable insights that can expedite the drug discovery pipeline.
At this preliminary stage, the project focuses on preparing molecular data for subsequent pharmacokinetic and toxicity evaluations. This includes:
- Filtering initial drug repurposing results based on specific criteria.
- Retrieving essential chemical identifiers (CID) and structural representations (SMILES) from the PubChem database.
- Formatting data for compatibility with external ADME prediction tools like SwissADME.
- Initial work on clustering of results and identification of significant drug groups using unsupervised learning methods has also been addressed.
The complete codes and explanations for the remaining, more advanced steps are currently under development and will be added soon.
This step processes raw drug repurposing data to refine the list of candidates based on predefined criteria.
- Input:
export.csv(a CSV file containing initial drug repurposing results, includingRank,Score,Type,ID,Name,Description). - Process:
- Reads the
export.csvfile. - Filters rows based on
Type(e.g., 'cp', 'kd', 'oe', 'cc') andScorevalues greater than 90.
- Reads the
- Output: Four distinct CSV files, each containing filtered data for a specific drug type with scores exceeding 90:
filtered_cp_above_90.csvfiltered_kd_above_90.csvfiltered_oe_above_90.csvfiltered_cc_above_90.csv
This step integrates with the PubChem API to enrich the filtered drug candidates with chemical identifiers and structural information, crucial for ADME analysis.
- Input:
filtered_cp_above_90.csv(or any of the filtered files from Step 1). - Process:
- Reads the filtered data and extracts drug names.
- Utilizes the PubChem PUG REST API to fetch the Compound ID (CID) for each drug name.
- Uses the obtained CID to retrieve the SMILES (Simplified Molecular-Input Line-Entry System) notation, a standard chemical structure representation.
- Includes a
time.sleepdelay between API requests to prevent rate limiting.
- Output:
compounds.csv: A CSV file listing theCompound Name,CID, andSMILES Notationfor compounds successfully processed.molecules_for_adme.txt: A text file containing SMILES notations, formatted specifically for direct input into SwissADME (SMILES followed by compound name on each line).
The upcoming phases of this project will include:
- Pharmacokinetic Evaluation: Detailed ADME screening using SwissADME results (or similar tools) to assess absorption, distribution, metabolism, and excretion profiles.
- Toxicity Assessment: Prediction of toxicities utilizing the Tox21 dataset and relevant models.
- Clustering Analysis: Application of unsupervised learning methods such as KMeans and Hierarchical Clustering to identify significant drug groups and patterns within the repurposing results.
- Quantitative Structure-Activity Relationship (QSAR) Modeling: Implementation of QSAR models using various machine learning techniques including:
- Random Forest
- Logistic Regression
- Support Vector Machine (SVM)
- Gradient Boosting
- Comparison of results with Deep Neural Network (DNN) approaches.
Stay tuned for these comprehensive updates!
- Python
- Jupyter Notebook
- Pandas: For data manipulation and CSV file processing.
- Requests: For making HTTP requests to external APIs (e.g., PubChem).
- CSV: For handling CSV file writes.
- Time: For managing API request rates.
- (Future) Scikit-learn: For various machine learning algorithms (QSAR, clustering, etc.).
- (Future) Matplotlib & Seaborn: For data visualization.
- (Future) TensorFlow/Keras or PyTorch: For Deep Neural Network implementations.
Below are examples of the intermediate and final files generated by the current script:
export.csv (Input Example):
Rank,Score,Type,ID,Name,Description
1,99.98,oe,ccsbBroad304_01966,RUVBL1,ATPases / AAA-type
2,99.98,kd,CGS001-8848,TSC22D1,-
3,99.96,oe,ccsbBroad304_00841,IKBKB,IKK family
4,99.94,kd,CGS001-1196,CLK2,CDC-like kinases
5,99.88,cp,BRD-A02333338,cyclopamine,Smoothened receptor antagonist
...
filtered_cp_above_90.csv (Example Output from Step 1):
Rank,Score,Type,ID,Name,Description
5,99.88,cp,BRD-A02333338,cyclopamine,Smoothened receptor antagonist
20,99.25,cp,BRD-K90543092,levonorgestrel,Estrogen receptor agonist
21,99.21,cp,BRD-K59456551,methotrexate,Dihydrofolate reductase inhibitor
...
compounds.csv (Example Output from Step 2):
Compound Name,CID,SMILES Notation
cyclopamine,442972,CC1CC2C(C(C3(O2)CCC4C5CC=C6CC(CCC6(C5CC4=C3C)C)O)C)NC1
levonorgestrel,13109,CCC12CCC3C(C1CCC2(C#C)O)CCC4=CC(=O)CCC34
methotrexate,126941,CN(CC1=CN=C2C(=N1)C(=NC(=N2)N)N)C3=CC=C(C=C3)C(=O)NC(CCC(=O)O)C(=O)O
...
molecules_for_adme.txt (Example Output from Step 2):
CC1CC2C(C(C3(O2)CCC4C5CC=C6CC(CCC6(C5CC4=C3C)C)O)C)NC1 cyclopamine
CCC12CCC3C(C1CCC2(C#C)O)CCC4=CC(=O)CCC34 levonorgestrel
CN(CC1=CN=C2C(=N1)C(=NC(=N2)N)N)C3=CC=C(C=C3)C(=O)NC(CCC(=O)O)C(=O)O methotrexate
...