Interpreting Quantitative Structure-Activity Relationships using Shapley Values of R-Groups
For details, please refer to our J. Chem. Inf. Model. paper.
If you find this software useful in your work, please cite our paper.
- python = 3.7
- rdkit = 2021.09.2
- sklearn = 1.0.1
- shap = 0.41.0
- numpy
- scipy
- pandas
A table containing a number of congeneric compounds is required as input.
Specifically, each compound shoule be organized as a single row with its ID, SMILES, and Activity as columns.
The activity should be in log scale, e.g., pKi, pIC50, -log10[Mol/L].
Note that other properties, if can be decomposed into side chains, can also be used for analyzing.
Properties such as T1/2 (time), Tmax (time) are not suitable here.
For example,
| ID | SMILES | Activity |
|---|---|---|
| 1 | CCN1C2=C([C@@H]([C@H](NC(=O)C3=CC=CC=C3)C1=O)C1=CC=C(F)C=C1)C(C)=NN2C1=CC=CC=C1 | 5.88 |
| 2 | CCN1C2=C([C@@H]([C@H](NC(=O)C3=CC(C)=CC=C3)C1=O)C1=CC=C(F)C=C1)C(C)=NN2C1=CC=CC=C1 | 6.70 |
| 3 | CCN1C2=C([C@@H]([C@H](NC(=O)C3=CC(Cl)=CC=C3)C1=O)C1=CC=C(F)C=C1)C(C)=NN2C1=CC=CC=C1 | 6.19 |
Run the following script for side-chain decomposition using auto scaffold:
python decompose.py -i ./demo/demo.csv -o ./demo/
You may also specify the scaffold in SMILES format with -c option:
python decompose.py -i ./demo/demo.csv -c O=C1C\(N[*:5]\)C\([*:4]\)c2c\([*:3]\)nn\([*:2]\)c2N1[*:1] -o ./demo/
Note that the special characters (e.g., (, )) in SMILES should be converted using backslash (e.g., \(, \)).
You can modify the scaffold until satisfaction.
Run the following script for MolSHAP analysis:
python molshap.py -i ./demo/demo.csv -o ./demo/
| filename | description |
|---|---|
prefix_decomp.pk |
Decomposition results |
prefix_frag2idx.pk |
Mapping from fragment SMILES to index |
prefix_contrib.pk |
Side-chain fragment contributions by index |
prefix_contrib_by_smiles.pk |
Side-chain fragment contributions by SMILES |
prefix_opt.csv |
Table of optimized compounds |
prefix_opt.png |
Scatter plot of optimized compounds |