This is the code repo for the paper Accurate and Rational Collision Cross Section Prediction using Voxel Projected Area and Deep Learning. We developed a Projected Area-based CCS prediction method (PACCS) directly from molecular conformers. PACCS supports users to generate large-scale and searchable CCS databases using the open-source Jupyter Notebook.
We recommend to use conda and pip.
By using the requirements/conda/environment.yml or requirements/pip/requirements.txt file, it will install all the required packages.
git clone https://github.com/yuxuanliao/PACCS.git
cd PACCS
conda env create -f requirements/conda/environment.yml
conda activate PACCS
PACCS calculates the projected area with the voxel-based approach, computes the m/z, and constructs the molecular graph. The related method is shown in VoxelProjectedArea.py, MZ.py, and MolecularRepresentations.py.
1. Generate 3D conformers of molecules.
mol = Chem.MolFromSmiles(smiles)
mol = Chem.AddHs(mol)
ps = AllChem.ETKDGv3()
ps.randomSeed = -1
ps.maxAttempts = 1
ps.numThreads = 0
ps.useRandomCoords = True
re = AllChem.EmbedMultipleConfs(mol, numConfs = 1, params = ps)
re = AllChem.MMFFOptimizeMoleculeConfs(mol, numThreads = 0)
- ETKDGv3 returns an EmbedParameters object for the ETKDG method - version 3 (macrocycles).
- EmbedMultipleConfs generates the 3D conformers of molecules.
- MMFFOptimizeMoleculeConfs optimizes the 3D conformers of molecules.
2. Calculate voxel projected area. For details, see VoxelProjectedArea.py.
- Using the Fibonacci grids approach to distribute points evenly over the surfaces of 3D atomic spheres.
- Projected on three coordinate planes (xy, xz, yz).
- Averaging.
Train the model based on your own training dataset with Training.py function.
PACCS_train(input_path, epochs, batchsize, output_model_path)
Optionnal args
- input_path : File path for storing the data of SMILES and adduct.
- Hyperparameters : optimized hyperparameters (epochs, batchsize).
- output_model_path : File path where the model is stored.
The predicted CCS values of molecules are obtained by feeding the voxel projected area, molecular graph, one-hot encoding of adduct type, and m/z into the already trained PACCS model with Prediction.py.
PACCS_predict(input_path, model_path, output_path)
Optionnal args
- input_path : File path for storing the data of SMILES and adduct.
- model_path : File path where the model is stored.
- output_path : Path to save predicted CCS values.
- The curated dataset is randomly split into the training, validation, and test sets in a ratio of 8:1:1.
- The external test set is used to compare the performance of different methods.
The example code for model training is included in the Model training.ipynb. By directly running Model training.ipynb, users can train the model based on your own training dataset.
The example code for CCS prediction is included in the CCS prediction.ipynb. By directly running CCS prediction.ipynb, users can use PACCS to predict CCS values.
The CCS values of molecules can be predicted via the colab link prediction.ipynb, which supports users in predicting CCS values directly, without downloading.
The example code for generating large-scale CCS databases by PACCS is included in the database generation.ipynb. By directly running database generation.ipynb, users can easily customize and generate their large-scale CCS databases by PACCS based on their practical needs.