Scikit-Learn classes for molecular vectorization using RDKit
The intended usage is to be able to add molecular vectorization directly into scikit-learn pipelines, so that the final model directly predict on RDKit molecules or SMILES strings
As example with the needed scikit-learn and -mol imports and RDKit mol objects in the mol_list_train and _test lists:
pipe = Pipeline([('mol_transformer', MorganTransformer()), ('Regressor', Ridge())])
pipe.fit(mol_list_train, y_train)
pipe.score(mol_list_test, y_test)
pipe.predict([Chem.MolFromSmiles('c1ccccc1C(=O)C')])
>>> array([4.93858815])
The scikit-learn compatibility should also make it easier to include the fingerprinting step in hyperparameter tuning with scikit-learns utilities
The first draft for the project was created at the RDKIT UGM 2022 hackathon 2022-October-14
- Transformer Classes
- SmilesToMol
- Desc2DTransformer
- MACCSTransformer
- RDKitFPTransformer
- AtomPairFingerprintTransformer
- TopologicalTorsionFingerprintTransformer
- MorganTransformer
- Utilities
- CheckSmilesSanitazion
Users can install latest tagged release from pip
pip install scikit-mol
Bleeding edge
pip install git+https://github.com:EBjerrum/scikit-mol.git
Developers
git clone [email protected]:EBjerrum/scikit-mol.git
pip install -e .
None yet, but there are some # %% delimted examples in the notebooks directory that have some demonstrations
Probably still
- Add rest of RDKit fingerprints
- Integration tests
- Docstrings for classes and methods
- Numpy style
- Make further example notebooks
- Standalone usage (not in pipeline)
- Advanced pipelining
- Hyperparameter optimization via external optimizer e.g. https://scikit-optimize.github.io/stable/
- Esben Jannik Bjerrum, [email protected]
- Carmen Esposito @cespos
- Son Ha, [email protected]
- Oh-hyeon Choung, [email protected]
- Andreas Poehlmann, @ap--
- Ya Chen, @anya-chen
- Rafał Bachorz @rafalbachorz