- Research
- Open access
- Published:
MaskMol: knowledge-guided molecular image pre-training framework for activity cliffs with pixel masking
BMC Biology volume 23, Article number: 279 (2025)
Abstract
Background
Activity cliffs, which refer to pairs of molecules that are structurally similar but show significant differences in their potency, can lead to model representation collapse and make the model challenging to distinguish them.
Results
Our research indicates that as molecular similarity increases, graph-based methods struggle to capture these nuances, whereas image-based approaches effectively retain the distinctions. Thus, we developed MaskMol, a knowledge-guided molecular image self-supervised learning framework. MaskMol accurately learns the representation of molecular images by considering multiple levels of molecular knowledge, such as atoms, bonds, and substructures. By utilizing pixel masking tasks, MaskMol extracts fine-grained information from molecular images, overcoming the limitations of existing deep learning models in identifying subtle structural changes. Experimental results demonstrate MaskMol’s high accuracy and transferability in activity cliff estimation and compound potency prediction across 20 different macromolecular targets, outperforming 25 state-of-the-art deep learning and machine learning approaches. Visualization analyses reveal MaskMol’s high biological interpretability in identifying activity cliff-relevant molecular substructures. Notably, through MaskMol, we identified candidate EP4 inhibitors that could be used to treat tumors.
Conclusions
This study raises awareness about activity cliffs and introduces a novel method for molecular image representation learning and virtual screening, advancing drug discovery and providing new insights into structure-activity relationships (SAR).
Background
Drug discovery has always posed a significant challenge in life sciences, and its outcome could tremendously impact medical research. Recently, the advancements in machine learning and artificial intelligence are now opening up new possibilities and leading to breakthroughs in the field of drug discovery [1,2,3]. Over the past few years, machine learning has made remarkable advancements in various aspects of early drug discovery, such as molecular generation [4,5,6], molecular optimization [7,8,9], and molecular property prediction [10,11,12,13,14,15,16,17,18,19,20]. These technologies are offering more efficient and accurate methods for developing new drugs.
Molecular property prediction plays a vital role in the drug discovery and design process, as it directly impacts the safety, effectiveness, and efficiency of drug development [21]. The fundamental concept behind molecular property prediction is that molecules with similar structures tend to have similar properties [22]. As shown in Fig. 1a left, molecules with distinct scaffolds exhibit different activities, and they can be well separated. However, there are cases called activity cliffs [23], where two molecules with similar structures have significantly different biological activities (Fig. 1a right). Predicting activity cliffs holds substantial importance in rational drug design and the efficient discovery of new therapeutic agents [24]. Anticipating cliffs provides crucial insights into SAR and optimizes lead compounds more effectively, leading to more reliable biological activity prediction.
Overview of the MaskMol framework. a Illustration of SAR (left) and activity cliffs (right) in feature space. Highly active molecules are shown in red boxes, and low-active ones in green. b Comparison between graph and image representations in feature space. Similarity is measured by the Tanimoto coefficient on ECFP pairs [33], and distance is the average Euclidean distance in the 2D feature space of 1000 molecule pairs. Images are encoded by ResNet18 [32], while graphs use GCN, MPNN, and GAT. c The MaskMol framework includes knowledge-guided pixel masking and masked pixel prediction. d Finetuning is performed on downstream tasks (e.g., activity cliff estimation, potency prediction), where both the encoder and the predictor head are trainable
Activity cliff task is a very important yet understudied task in the field of drug discovery. Previous studies [25,26,27] have observed that graph-based models have poor performance on activity cliffs. We conjecture that the graph-based representation learning methods cannot separate two similar molecules in the feature space, which is called representation collapse, resulting in poor performance on the activity cliffs. As shown in Fig. 1b, we evaluate the performance of various GNN architectures on the activity cliffs, such as GCN [28], GAT [29], and MPNN [30]. Figure 1a clearly shows that as the similarity between pairs of molecules increases, the distance in the feature space of graph-based methods decreases faster, proving our conjecture. We defined this phenomenon as representation collapse. Therefore, we turn to discover other representations of molecules and find that various graph-based representation learning methods are inferior to image-based representation learning methods in identifying differences between similar molecules. Although molecular graphs and images describe the same molecular information, they are essentially different due to modal differences (graph versus image) and feature extraction differences (GNN versus CNN). In the activity cliff task, pairs of molecules have very similar structures and significant differences in activity. For example, a difference of just one atom can lead to a completely different activity. Increasing the discrimination between two similar molecules is the key to the success of predicting activity cliffs for deep learning models. For the GNN model, the small structural difference will be over-smoothed [31] out during information aggregation, resulting in little difference in the extracted features. This is also why GNN methods perform poorly on activity cliff tasks. For images, the convolution operation in CNN has the characteristics of local connectivity and parameter sharing, which makes the model pay more attention to local features to preserve these differences [32]. These observations indicate that image-based methods can amplify the differences between two similar molecules and motivate us to develop an image-based method for more accurate activity cliff prediction.
Besides, obtaining labels for activity cliffs requires expensive and time-consuming wet experiments. The inadequacy of labeled data significantly impacts model performance. Thus, we turn our attention to the pretrain-finetune paradigm [34,35,36,37,38,39], because the pre-training process doesn’t need labels, and few labels can be used in the fine-tuning phase to enhance performance. However, unlike natural images, molecular images are not as information-dense and have many blank areas. If we simply apply the pre-trained framework in computer vision, such as MAE [35] directly to molecular images, it would be challenging for the model to utilize meaningful molecular knowledge to identify subtle changes in cliff molecules. Therefore, it is necessary to use molecular domain knowledge to guide the model to learn molecule structures.
Moreover, activity cliffs often arise due to subtle changes at various molecular levels [40, 41], such as specific atom substitutions, bond modifications, or functional group replacements. At the atomic level, substituting a hydrogen atom on a benzene ring with a chlorine atom can lead to significant changes in the molecule’s binding interactions with receptors, thereby affecting its biological activity. Changing a single bond in a molecule to a double bond may alter the molecule’s shape and electronic distribution, thereby affecting its interactions with targets and its biological activity. Replacing a hydroxyl group on a benzene ring with a methyl group. While the structural difference is insignificant, the hydroxyl group can form hydrogen bonds, significantly affecting the molecule’s solubility and interactions with biological targets. As a result, our objective is to incorporate prior chemical knowledge into the model and utilize this activity cliff-related knowledge to instruct the model in learning molecules. Here, we present a novel self-supervised pre-training framework called MaskMol, which focuses on learning fine-grained representations from molecular images with knowledge-guided pixel masking. We design three pixel masking-based pre-training tasks with three different levels of knowledge, involving atomic knowledge, bond knowledge, and motif knowledge. These tasks enable MaskMol to comprehensively learn the local regions of molecules by pixel-level knowledge prompts. In summary, our main contributions are:
-
We first pinpoint the bottleneck in the molecular activity cliff task that the cliff molecules give rise to deep learning model representation collapse. Image-based model is superior to graph-based model due to alleviating representation collapse.
-
We design a novel and multi-level knowledge-guided molecular image self-supervised learning framework (called MaskMol) using a pixel masking strategy. After pre-training on a large-scale dataset consisting of approximately two million molecules, MaskMol demonstrated a significant performance enhancement on activity cliff estimation datasets and compound potency prediction datasets.
-
Explainable case study and visualization demonstrate that MaskMol strongly enables cliff awareness for bioactivity estimation and extracting meaningful SAR information for intuitive interpretation.
-
Through MaskMol, we identified candidate EP4 inhibitors that could be used to treat tumors, demonstrating that MaskMol can be used as a promising method under activity cliff virtual screening scenario.
Results
Overview of MaskMol
This section gives an overview of our MaskMol, highlighted in Fig. 1c and d. To accurately estimate molecular activity cliffs, we developed a knowledge-guided molecular image pre-training framework by fine-grained pixel masking, MaskMol. It consists of two parts: (1) three knowledge-guided pixel masking strategies, and (2) three knowledge-guided masked pixel prediction tasks for pre-training. See the Methods section for more descriptions on MaskMol.
Firstly, the conversion from molecular SMILES to molecular images is performed using RDKit. To eliminate any extraneous color effects, we proceed by removing all non-essential hues from the molecular images. Next, again leveraging RDKit, we apply green hues to atoms, bonds, and motifs separately. In the following, HSV detection isolates regions with green pixels within the highlighted image. To introduce an element of randomness, we select a subset of atom/bond masking images by randomly choosing a fraction (determined by the masking ratio \(\gamma\)) from the available masking atom/bond image sets. Consequently, we generate a set of masking images, totaling \(\gamma \cdot {N_{atom}}\) and \(\gamma \cdot {N_{bond}}\) in number. It is important to note that to ensure that motifs do not cross each other, we only randomly select one masking image from the set of masking motif images. Moving forward, the masked image is combined with the original molecular image. Precisely, we adjust the white region of the masking image to correspondingly modify the region of the molecular image. In this synthesis process, we end up with three masked molecular images: the masked atom/bond/motif image. The three images are input through ViT to obtain latent features and classified through different fully connected layers. The pre-trained molecular encoder is fine-tuned on downstream tasks to further improve model performance.
Model performance on downstream tasks
To evaluate the effectiveness of the image-based representations learned by MaskMol, we choose wide-ranging popular or state-of-the-art baselines for comparison on activity cliff estimation benchmark (ACE) called MoleculeACE [25], including 12 pre-training baselines and 11 traditional machine learning methods. We refer to the original paper [25] and follow its strategy to split the dataset. To assess the generalization of MaskMol, we employed a widely used splitting strategy known as scaffold split on ACE task and compound potency prediction (CPP) task. This is a more challenging but practical setting since the test molecules can be structurally different from the training set.
Activity cliff estimation
As shown in Fig. 2a, we compared the performance of MaskMol with five types of state-of-the-art self-supervised molecular representation models: (1) sequence-based, (2) 2D graph-based, (3) 3D graph-based, (4) image-based, and (5) multimodal-based models. MaskMol has a better performance compared with sequence-based (for example, ChemBERTa [42]), 2D graph-based (for example, GROVER [43], MolCLR [44], EdgePred [45], Mole-BERT [20], and InstructBio [46]), 3D graph-based (for example, GEM [47]), image-based models (for example, ImgaeMol [21]), and multimodal-based (3DInformax [48], GraphMVP [49] and CGIP [50]) using MoleculeACE experimental set-up. Compared with the second-best model (InstructBio), the elevated RMSE of MaskMol ranges from 2.3% to 22.4% with an overall relative improvement of 11.4% across 10 ACE datasets, in particular for HRH3 dataset (19.4% RMSE improvement) and ABL11 dataset (22.4% RMSE improvement). In addition, MaskMol achieved lower RMSE values (Fig. 2b) on D4R (RMSE = 0.73), DAT (RMSE = 0.59), FX (RMSE = 0.73), GSK3 (RMSE = 0.69), HRH3 (RMSE = 0.58), SOR (RMSE = 0.76), ABL11 (RMSE = 0.66), GR (RMSE = 0.68), CLK4 (RMSE = 0.85), and OX2R (RMSE = 0.67) compared with traditional ECFP-based methods across multiple machine learning algorithms, including support vector machine [51], random forest [52], k-nearest neighbors [53], multilayer perception [54], and gradient boosting machine [55]. In summary, Our method MaskMol, surpasses other state-of-the-art methods, achieving the lowest RMSE in these comparisons. To further substantiate MaskMol’s efficacy in identifying activity cliff pairs, we showcase results using \(\text {RMSE}_{\text {cliff}}\) as an additional performance metric. On the DAT and OX2R datasets, MaskMol achieves a 6.7% improvement in \(\text {RMSE}_{\text {cliff}}\) compared to the second-best method (\(\text {SVM}_{\text {ECFP}}\)). Taking into account the two metrics of RMSE and \(\text {RMSE}_{\text {cliff}}\), MaskMol also has a lower value than any other state-of-the-art molecular representation models (Fig. 2c). Furthermore, to evaluate the disparity between the prediction and label, we employ Kullback-Leibler Divergence (KLD [56]) for measuring distribution differences (Fig. 2d). The KLD values of all ACE datasets are significantly lower, and the distributions of label and prediction values are close, except for CLK4. We hypothesize that the relatively pronounced discrepancies observed in the CLK4 dataset could be attributed to its limited molecule number (731), which may have resulted in an under-fitted model.
Performance of the MaskMol framework on activity cliff estimation (ACE). a RMSE comparison across 10 ACE datasets from MoleculeACE using different representation pre-training methods. b Violin plot showing RMSE distributions compared with traditional machine learning methods. c Comparison of RMSE and \(\text {RMSE}_{\text {cliff}}\) across all methods. Green, purple, yellow, and blue represent the SOTA methods based on fingerprint (\(\text {SVM}_{\text {ECFP}}\)), sequence (LSTM), graph (GraphMVP), and image (ImageMol), respectively. d Distribution of label vs. prediction values on ACE evaluated by kernel density estimation; Kullback–Leibler divergence (KLD) measures distributional differences
To test the generalization of MaskMol, we split the datasets using a scaffold split (Fig. 3a). We found that MaskMol significantly outperforms \(\text {SVM}_{\text {ECFP}}\) models across all 10 ACE datasets. For instance, the RMSE values of MaskMol (RMSE = 0.69) compared with \(\text {SVM}_{\text {ECFP}}\) model (RMSE = 0.97) in the prediction of ABL11 are elevated by over 28.9%. We further evaluated the \(\text {RMSE}_{\text {cliff}}\), compared with \(\text {SVM}_{\text {ECFP}}\) models, MaskMol achieves better performance with a performance advantage of 6.4% on average, in particular for SOR (20.9% \(\text {RMSE}_{\text {cliff}}\) improvement). Compared with the molecule image pre-training model (ImageMol), the elevated RMSE of MaskMol ranges from 6 to 28.8% with a performance advantage of 17% on average, the elevated \(\text {RMSE}_{\text {cliff}}\) ranges from 9.4% to 40% with a performance advantage of 19.4% on average.
Performance of MaskMol under scaffold splitting and ablation study. a, b Performance comparison on activity cliff estimation (ACE) and compound potency prediction (CPP) tasks using RMSE and MAE, respectively. Results are averaged over three independent runs with random seeds (0, 1, 2), reported as mean ± standard deviation. c Ablation study of pretext tasks in MoleculeACE. “w/o AMPP,” “w/o BMPP,” and “w/o MMPP” indicate removal of corresponding components during pre-training. “Gain” refers to the improvement of \(\text {MaskMol}_{\text {base}}\) over \(\text {MaskMol}_{\text {Non-Pretrain}}\). d Effect of different masking ratios in pre-training; x-axis denotes the masking ratio in AMPP and BMPP tasks, with error bands indicating standard deviation. e Analysis of image size and useful pixel ratio on ACE datasets. The ratio is computed as molecule pixels divided by total image pixels across 12,590 images. RDKit is used to enhance pixel density by bolding chemical bonds. Statistical significance is assessed using the Mann–Whitney U test
These results validate MaskMol’s ability to precisely predict molecules exhibiting activity cliffs. Notably, ECFP-based methods demonstrate robust performance, whereas graph-based methods tend to underperform in activity cliff estimation. Graph-based models are vulnerable to representation collapse when faced with activity cliffs, and they face challenges in learning from non-smooth objective functions [27]. Furthermore, we found that image-based methods such as ImageMol have lower RMSE and \(\text {RMSE}_{\text {cliff}}\) than graph-based algorithms (EdgePred, GraphMVP, 3DInfomax, Mole-BERT). We compared the 2D Graph-2D Image (CGIP) and 2D Graph-3D Geometry (e.g., 3DInfomax and GraphMVP) frameworks. Results in Additional file 1: Tables S1 to S6 show that 2D image representations achieve higher performance (e.g., + 8.53% RMSE) than 3D geometry-based methods. Within CGIP, replacing DeeperGCN (CGIP-DeeperGCN) with a ResNet18 (CGIP-ResNet18) improves activity cliff prediction by + 2.60%, further validating that CNN-based image representations better capture subtle structural variations compared to GNN-based encoders. These further demonstrate that the CNN-based model can use local inductive biases to identify subtle cliff changes. Although InstructBio attempts to mitigate representation collapse by leveraging a substantial amount of unlabeled data as pseudo-labels, it still does not match the performance of ECFP-based methods. The addition of pseudo-labels helps to clarify class boundaries [57], suggesting that semi-supervised learning could emerge as a novel solution for addressing activity cliffs.
Compound potency prediction
Although MaskMol is primarily designed for solving fine-grained tasks such as ACE, it also performs well on the coarse-grained task of CPP. Compound potency prediction is crucial to the drug discovery and design process [58, 59]. Researchers aim to forecast the biological activity of chemical compounds, explicitly measuring their potency in terms of the amount needed to produce a desired effect. As shown in Fig. 3b, MaskMol has a better performance compared with sequence-based (ChemBERTa), graph-based (MolCLR, MGSSL, MPG [60], and GraphMVP), and image-based models (ImageMol) using a scaffold split. Notably, on the BACE1 dataset, MaskMol achieves a small MAE of 0.56, while the best-performing baseline model (ImageMol), achieves 0.63. It is worth mentioning that MaskMol achieves this performance using only 2M pre-training data, compared to the 10M pre-training data used by ChemBERTa and ImageMol. This demonstrates that MaskMol can achieve superior performance with significantly less pre-training data.
Ablation studies on MaskMol
We perform comprehensive experiments to investigate the impact of each component in MaskMol on the activity cliff estimation. As illustrated in Fig. 3c, seven out of ten datasets have a pre-training gain of more than 30% and the gain reaches its peak at 45.87% on the DAT dataset. Furthermore, the average gain across all ACE datasets surpasses 34.43%, underscoring the substantial enhancement in MaskMol’s performance attributable to knowledge-guided masked pixel prediction tasks. Unlike graphs, graph treats molecules as nodes and bonds, which encode a large amount of chemical information such as atom types and bond types. For an initialized image model, molecules are input into the model in the form of RGB pixels, which do not contain any chemical information. The model’s understanding of molecular images is limited to the fact that the image is composed of some “line.” Therefore, it is necessary to help the model understand the chemical information in the image, which allows the model to understand the specific meaning of the “lines.” in the image. This is why we can see that MaskMol has greater improvement gains than graph-based GROVER before and after pre-training (34.43% versus 8.53%). The observed decline in performance for “w/o AMPP” (RMSE 4.5% decline), “w/o BMPP” (RMSE 16.4% decline), and “w/o MMPP” (RMSE 21% decline) indicates that the removal of any level knowledge-guide task adversely affects MaskMol’s performance, with MMPP being the most influential. We also explored the impact of pre-training with different data scales.
The size of pre-training dataset for \(\text {MaskMol}_{\text {base}}\) and \(\text {MaskMol}_{\text {small}}\) are 0.2M and 2M respectively. To explore the impact of pre-training with different data scales, we used 0 million (no pre-training), 0.2 million, 2 million, and 5 million drug-like compounds to pretrain MaskMol, respectively, and then evaluate their performance. We found that the average RMSE performance increased from 26.2 to 34.0% (average \(\text {RMSE}_{\text {Cliff}}\) performance increased from 17.0% to 28.0%) as the pre-trained data size increased (see Additional file 1: Tables S7–S8). Thus, MaskMol can be further improved as more drug-like molecules can be pre-trained.
Additionally, we delve into analyzing the implications of the masking ratio, examining how its value affects MaskMol’s overall performance (Fig. 3d). It is worth noting that the optimal masking ratio in our study significantly deviates from the typical ratios used in BERT and MAE. BERT typically employs a masking ratio of 15%, whereas MAE utilizes a masking ratio as high as 75%. However, we found that a 50% masking ratio yields optimal results in our experiments. Molecular images are rather sparse with most pixels being empty and the resolution of the images is important in such settings. Thus, we research the impact of image size and the ratio of empty spaces to useful pixels on the learned representations (Fig. 3e). The results show that the image size and useful pixel ratio achieved similar performance on ACE dataset (p > 0.05, Mann-Whitney U test [61]).
Interpretation of MaskMol
Investigation of MaskMol representation
We use t-SNE to compare the representation learned by MaskMol with the ECFP fingerprints feature (Fig. 4a, b). The t-SNE algorithm maps similar molecular representations to adjacent points in two dimensions. We observe that ECFP can only be mapped based on structure, resulting in active and inactive molecules being mixed in the feature space. Through multi-level knowledge-guided masked pixel prediction tasks, MaskMol can be aware of changes in atom/bond/motif when any atom/bond/motif in the image changes. Thus, the representations learned by MaskMol can effectively distinguish between active and inactive molecules, with a clear boundary between them. Additionally, we have included some randomly selected pairs of activity cliffs in the figure to illustrate the similar and dissimilar molecules learned by MaskMol based on their biological activity. MaskMol can learn similar representations from molecules with similar structures and properties and map molecules with significant differences in structures and properties to distinct feature spaces. This demonstrates that MaskMol learns the topological structure information between molecules and uses properties to differentiate between molecules.
Feature distribution and attention interpretation of MaskMol. a, b t-SNE visualization of molecular ECFP fingerprints (a) and MaskMol representations (b) on the D4R dataset. Points are colored by their \(\text {K}_i\) values—cooler colors for higher values and warmer for lower. Structurally similar but bioactivity-divergent molecule pairs are boxed in matching colors. c Comparison of ECFP fingerprint and MaskMol in quantifying the overall relative distance of active cliff pairs in the latent features space. d Grad-CAM heatmaps showing MaskMol’s attention at three knowledge levels; warmer areas indicate stronger attention. e Visualization of explanatory structures identified by different deep learning methods. Image-based methods use Grad-CAM, while graph-based methods employ PGExplainer
To measure the distance between active cliff pairs in feature space, we introduce a distance metric \(d = \frac{1}{N}\sum \nolimits _{i = 1}^N {{\rho _i}}\), \({\rho _i} = \sqrt{{{\left( {{x_1} - {x_2}} \right) }^2} + {{\left( {{y_1} - {y_2}} \right) }^2}}\), where \({M_1} = \left( {{x_1},{y_1}} \right) , {M_2} = \left( {{x_2},{y_2}} \right)\) are active cliff pair coordinates in the unified feature space. Figure 4c illustrates that in all ACE datasets, the distance between active cliff pairs in the feature space generated by MaskMol is considerably, significantly greater than that of ECFP. This observation highlights the effectiveness of MaskMol in accurately estimating activity cliffs, as it can capture subtle structural variations and utilize them to describe and represent molecules.
Explaining MaskMol via attention visualization
We applied three levels of knowledge-guided pixel masking to the molecular images and used Grad-CAM [62] to visualize the areas of attention (Fig. 4d). The results show that MaskMol accurately classifies the knowledge and focuses on the appropriate masked areas. This indicates that our three knowledge-guided masked pixel prediction tasks allow the model to identify different molecular chemical structures.
In Fig. 4e, we provide a comparative analysis of key substructures associated with activity cliffs, as extracted by various deep learning (DL) methods. We select the top-3 most crucial edges detected by PGExplainer [63]. GNNs tend to allocate attention to insignificant regions of the cliff molecule and emphasize the identical structure. This observation supports our hypothesis that GNNs are susceptible to representation collapse when dealing with active cliffs, thereby hindering their ability to correctly identify cliff molecules. We can see that ImageMol focuses on large areas of the molecules, while MaskMol, without pre-training, only focuses on the entire molecule and ignores irrelevant blank areas. However, neither of them pays attention to the important substructure that affects the activity. MaskMol successfully identifies the most informative substructure and judges compound activity based on these substructures. These plots convincingly prove that MaskMol recognizes subtle differences in activity cliff pairs’ substructures and can provide reliable and informative insights for medicinal chemists in identifying key substructures.
Chemistry-intuitive exposition of MaskMol
We use Substructure-Mask Interpretation (SME [64]) to further quantify the contribution of substructure to MaskMol predictions. We define the impact of the masking substructure on the overall prediction as the attribution. We make two predictions with MaskMol, one before and one after applying the substructure masking to the molecular image, and consider the difference between the predicted values as the attribution: \(\text {Attribution}_{\text {sub}} = f\left( x \right) - f\left( {{x_{sub}}} \right)\), among them, x represents the molecular image, \(x_{sub}\) represents the molecular image of masking substructure, and f represents MaskMol. By calculating the contribution of substructure to model predictions, we can gain insight into the impact of substructure on molecule activity. As depicted in Fig. 5a, adding substructures such as benzene ring (Attribution = − 1.93, \(\text {K}_{i}\) = 5370 nM) and ethyl alcohol (Attribution = − 0.95, \(\text {K}_{i}\) = 758 nM), the attributions are lower than zero, and the influence of the benzene ring is greater than that of ethyl alcohol, which is highly consistent with the molecular activity value. It can also be found that the position of the propyl group affects the activity, and the attribution value also makes the same judgment. Figure 5b also shows the same conclusion in the DAT dataset. In addition to biological activity, we also present a chemically intuitive explanation of MaskMol on Mutagenicity. Figure 5c and d display the analysis of different substructures based on their Mutagenicity. A positive attribution indicates that the substructure contributes to toxicity, while a negative attribution suggests that the substructure has a detoxifying effect. Figure 5c reveals that nitro, amino, and quinone groups enhance the model’s ability to predict toxicity, while carboxyl groups improve the model’s prediction of non-toxicity. This observation aligns with previous studies, which have identified aromatic nitro, aromatic amino, and quinone groups as toxic and carboxyl groups as detoxifying [65,66,67].
Chemistry-intuitive interpretation of MaskMol and virtual screening on the EP4 target. a, b and c, d show MaskMol’s attribution for biological activity and toxicity, respectively. a, b Attribution maps for four compounds targeting HRH3 and DAT; black indicates true values, green indicates predictions. c Attribution of three mutagenic compounds. d Attribution scores for functional groups occurring more than 20 times, with blue indicating negative and red indicating positive impact on mutagenicity. The model achieved a high ROC-AUC of 0.90. e t-SNE of MaskMol representations for EP4 compounds: gray = train data, dark gray = inhibitors, red/blue = patent set inhibitors/non-inhibitors. Red star mark known EP4 inhibitors. f Prediction vs. ground truth for test set (left) and patent set (right). The black dotted line represents y = x, and the closer to this line, the warmer the point color
In summary, this visualization provides evidence that MaskMol is subtle structure-aware and exploits structural differences to make accurate predictions. Thus, MaskMol can provide meaningful and fresh SAR insights to help medicinal chemists in structural optimization and de novo design.
Virtual screening using MaskMol
EP4 receptor has been widely investigated and recognized as a promising drug target for cancer immunotherapy [68]. We manually collected data from multiple sources, including the BindingDB [69], ChEMBL database, and patent libraries targeting EP4. Canonicalization of the molecules was achieved utilizing RDKit, and duplication of SMILES was deleted, resulting in a finalized dataset comprising 1633 molecules. We evaluated the performance of MaskMol on EP4 targets with a random split of 8:1:1. We found that MaskMol has a low RMSE on the test set (RMSE = 0.577), and the prediction values are linearly correlated with the label values in Fig. 5f left (\(\text {R}^2\) = 0.789). The t-SNE visualization in the latent space showed a clear boundary between inhibitors and non-inhibitors (Fig. 5e gray dots). To test the generalization ability of MaskMol, we constructed an additional patent set (131 molecules) from the extended patents and literature as an external validation set (\(\text {R}^2\) = 0.755). We found that inhibitors and non-inhibitors in the patent test were also perfectly separated. MaskMol identified 9 known EP4 inhibitors and visualized these 9 molecules to embedding space (Fig. 5e), suggesting structural identification ability of MaskMol to learn discriminative information. These nine molecules (Grapiprant [70], L001 [71], CJ-042794 [72], MK-2894 [73], CR6086 [74], ONO-4578 [75], E7046 [76], HL-43 [77], and AMX12006 [78]) have been validated (including cell assay, clinical trial, or other evidence) as potential EP4 inhibitors. These findings demonstrate the ability of MaskMol to provide robust and generalizable molecular representation and prediction of inhibitors of targets, making it an efficient and effective virtual screening method.
Discussion
To enhance both the efficiency and predictive capability of our framework, we explored knowledge distillation by transferring expertise from our image-based MaskMol model (teacher) to a lightweight GCN (student). This distillation process yielded a significant 12.4% improvement in average RMSE, with a notable 7.9% reduction in \(\text {RMSE}_{\text {cliff}}\) specifically for activity cliff estimation tasks (see Additional file 1: Table S9). These results demonstrate that distillation effectively reduces computational demands while simultaneously enhancing performance, successfully transferring critical knowledge from image-based representations to graph-structured models. Furthermore, we rigorously evaluated the impact of molecular representation dimensionality by comparing 3D conformation images against conventional 2D image inputs for activity cliff prediction (see Additional file 1: Table S10). Our experiments reveal that utilizing 3D conformation images consistently reduces both average RMSE and \(\text {RMSE}_{\text {cliff}}\) compared to 2D representations, achieving an overall 3.7% improvement in predictive accuracy. This validates that explicitly encoding 3D structural information significantly augments the model’s ability to discern and predict activity cliffs. Building on these insights, future work will pursue three synergistic directions: (1) Integrating multimodal chemical knowledge (e.g., structural fingerprints, chemical reaction pathways) to enrich molecular representations; (2) Advancing explicit 3D conformational image modeling through contrastive pre-training on conformational ensembles; (3) Developing multimodal distillation frameworks to unify knowledge transfer across image, and 3D graph.
Conclusions
In the field of early-stage drug discovery, machine learning is gaining prominence, yet the concept of activity cliffs remains underexplored. Activity cliffs, which refer to structurally similar molecules with significant differences in potency, are critical for virtual screening and developing models that understand complex structure-activity relationships. Traditional graph-based methods often struggle with representation collapse due to high similarity between activity cliffs. To address this, we developed MaskMol, a knowledge-guided self-supervised learning framework utilizing molecular images. MaskMol employs three pre-training tasks with pixel masking, incorporating atomic, bond, and motif knowledge. This approach enables MaskMol to effectively learn local molecular regions and detect subtle changes in activity cliffs. Experimental results confirm MaskMol’s superior accuracy in predicting activity cliffs and its performance compared to other state-of-the-art algorithms. Extensive experiments and ablation studies validate the effectiveness of each MaskMol component and determine the optimal ratio for knowledge-guided pixel masking. Furthermore, MaskMol identifies critical substructures responsible for activity cliffs through visualization, enhancing researchers’ understanding of compounds and facilitating the drug discovery process. This study not only raises awareness about activity cliffs but also introduces a novel method for molecular image representation learning and virtual screening, advancing drug discovery and providing new insights into structure-activity relationships.
Methods
Knowledge-guided masked pixel prediction
Definition
A molecule’s 2D information can usually be represented as a graph \(G=(\mathcal {V}, \mathcal {E})\) with atoms \(\mathcal {V}\) as nodes and the edges \(\mathcal {E}\) given by covalent bonds. But in our experiments, the molecule is expressed as the image \(x \in \mathbb {R}^{H \times W \times C}\), where (H, W) is the resolution of the molecular image, C is the number of channels.
Atom-level masked pixel prediction
We counted the atom types of molecules in the pre-training data and selected the ten most frequent atom types (e.g., C, N, O, Cl). Correspondingly, the ten atom types serve as pseudo-labels for the atom-level masked pixel prediction (AMPP). Formally, the molecular image set and the pseudo-labels are \({\left\{ {x_{i} \in \mathbb {R}^{224 \times 224 \times 3}} \right\} _{i = 1}^N}\) and \({y^{atom}} \in {\left\{ {0,1, \cdot \cdot \cdot ,9} \right\} ^{10}}\) respectively. For each \({{x_i}}\), we will get the mask atom image sets \(M = \left\{ {{M_j}} \right\} _{j = 1}^{{N_{atom}}}\) by Masking. Random sampling M with a masking ratio \(\gamma\) to get the subset of M as \({M^*} = \left\{ {{M_j}} \right\} _{j = 1}^m\), where \(m = \gamma \cdot {N_{atom}}\) denotes the masking image number of subset. Then, we can obtain the masking atom image, denoted as \({{\tilde{x}}_i} = {x_i}\Theta {M^*}\), where \(\Theta\) indicates modifying the pixel value in \({x_i}\) corresponding to the white pixel area in M to white. Following ViT [36], we divide a masking atom image \({{\tilde{x}}_i}\) into regular non-overlapping patches. To save calculation time and make our model pay more attention to the masked patches, we only calculate the loss of the masked patches \(\Omega \left( {{{\tilde{x}}_i}} \right)\). Finally, the cost function of the AMPP task is as follows:
where \({{f_\theta }}\) and \(\theta\) refer to the mapping function and corresponding parameters of the molecular encoder, \(\omega\) represents the parameters of the fully connected classification layers, \(\ell\) is the cross-entropy (CE) loss function.
Bond-level masked pixel prediction
The workflow of the bond-level masked pixel prediction (BMPP) is similar to that of AMPP, and the difference is that there are only four bond types, i.e., single, double, triple, and aromatic, and the pseudo-labels are \({y^{bond}} \in {\left\{ {0,1,2,3} \right\} ^4}\). The BMPP loss function is defined as follows:
Motif-level masked pixel prediction
Breaking of Retrosynthetically Interesting Chemical Substructures (BRICS [79]) based on chemical reaction templates was utilized to partition functional groups. However, the functional group vocabulary obtained through the BRICS division is somewhat redundant. To address this issue, two rules defined in MGSSL [11] were applied to eliminate redundant functional groups. As a result, we obtained a motif vocabulary consisting of 9854 motifs. We opted for the top 200 motifs with the highest occurrence and eliminated molecules lacking these particular motifs to reduce time and space burdens on the MMPP task. The motif-level masked pixel prediction (MMPP) process is also consistent with AMPP. The difference is that the pseudo-labels are \({y^{motif}} \in {\left\{ {0,1, \cdot \cdot \cdot ,199} \right\} ^{200}}\) and we only randomly sample a motif in M as \({M^*}\) so that there is no intersection between motifs and the model can extract accurate motif information. It is worth noting that when calculating loss, we use the classification token feature \({{{\tilde{x}}_i}^{cls}}\) to classify and perform loss calculation. The MMPP loss function is defined as follows:
Pre-training and fine-tuning
Here, we used ViT as our molecular encoder. After using data augmentations and masking to obtain masking molecular images \({{\tilde{x}}_i}\), we forward these images \({{\tilde{x}}_i}\) to the ViT model to extract latent features \({{f_\theta }\left( {{{\tilde{x}}_i}} \right) }\). Then, these latent features are used by three pretext tasks to calculate the total cost function \(\mathcal {L}\), which is defined as
In order to pretrain our MaskMol, we first gathered 2 million unlabeled molecules with drug-like properties from the PubChem database [80]. We divided the 2M pre-training data into a training set (95%) and a validation set (5%), and judged the pre-training performance through the accuracy of each task. Finally, the AMPP, BMPP, and MMPP accuracy can reach 99.3%, 98.0%, and 89.6%, respectively. After the initial pre-training phase, we proceed to fine-tune the pre-trained encoder for the specific downstream tasks. In particular, we incorporate an extra fully connected layer after the encoder. The output dimension of this layer is set to match the number of categories associated with the downstream tasks.
Training details
Baselines
The performance regarding methods (MLP [54], GBM [55], RF [52], SVM [51], KNN [53], AFP [81], MPNN [30], GAT [29], GCN [28], CNN [82], LSTM [83]) is derived from MoleculeACE [25]. The performance regarding methods (MolCLR [44], GROVER [43], GEM [47], InstructBio [46]) is derived from InstructBio. We additionally execute experiments on the activity cliff estimation datasets following the same experimental setting used in Mole-BERT [20], EdgePred [45], GraphMVP [49], 3DInfomax [48], ImageMol [21], and CGIP [50].
Evaluation metrics
The overall performance of MaskMol was quantified via the mean absolute error (MAE) or root-mean-square error (RMSE) computed on the bioactivity values (i.e., \(\textrm{pK}_{i}\) or \(\textrm{pIC}_{50}\)): \(\text {MAE}=\frac{1}{n} \sum \nolimits _{i=1}^{n}|y_i-\hat{y}_i|\), \(\text {RMSE} = \sqrt{\frac{1}{n}\sum \nolimits _{i = 1}^n {{{\left( {{{\hat{y}}_i} - {y_i}} \right) }^2}} }\), where \({{{\hat{y}}_i}}\) is the predicted bioactivity of the ith molecule, \({{y_i}}\) is the corresponding experimental value, and n represents the total number of molecules. On activity cliffs, the performance of MaskMol was quantified by computing the root-mean-square error (\(\text {RMSE}_{\text {cliff}}\)) on compounds that belonged to at least one activity cliff pair: \(\mathrm{{RMS}}{\mathrm{{E}}_{\mathrm{{cliff}}}} = \sqrt{\frac{1}{{{n_c}}}\sum \nolimits _{i = 1}^{{n_c}} {{{\left( {{{\hat{y}}_i} - {y_i}} \right) }^2}} }\), where \({{{\hat{y}}_i}}\) is the predicted bioactivity of the ith compound, \({{y_i}}\) is the corresponding experimental value, and \({{n_c}}\) represents the total number of compounds on activity cliffs.
Hyperparameter
MaskMol is pre-trained by SGD optimizer with a learning rate of 0.01, weight decay 10-5, momentum 0.9, and batch size 128 for approximately 2 days with 4 NVIDIA A100 GPU (40GB). In downstream tasks, the pre-trained model is fine-tuned using SGD optimizer with batch size [8, 16, 32, 64], learning rate [5e-5, 5e-4, 5e-3], weight decay 10-5, momentum 0.9 on Ubuntu 18.04.1 with 15 vCPU Intel(R) Xeon(R) Platinum 8375 C CPU @ 2.90GHz and NVIDIA 4090 (20GB). More computational efficiency analysis is shown in Additional file 1: Table S11.
Data availability
All of the codes and materials are freely available at GitHub (https://github.com/ZhixiangCheng/MaskMol) and Zenodo (https://doi.org/10.5281/zenodo.15834481) [84]. The full dataset used in this study is available at figshare (https://figshare.com/articles/dataset/The_full_dataset_used_in_MaskMol_/29518031?file=56094683).
Abbreviations
- SAR:
-
Structure-activity relationships
- ACE:
-
Activity cliff estimation
- CPP:
-
Compound potency prediction
- KLD:
-
Kullback-leibler divergence
- DL:
-
Deep learning
- SME:
-
Substructure-mask interpretation
- AMPP:
-
Atom-level masked pixel prediction
- CE:
-
Cross-entropy
- BMPP:
-
Bond-level masked pixel prediction
- BRICS:
-
Breaking of retrosynthetically interesting chemical substructures
- MMPP:
-
Motif-level masked pixel prediction
- MAE:
-
Mean absolute error
- RMSE:
-
Root-mean-square error
References
Fleming N. How artificial intelligence is changing drug discovery. Nature. 2018;557(7706):S55–S55.
Zeng X, Wang F, Luo Y, Kang Sg, Tang J, Lightstone FC, et al. Deep generative molecular design reshapes drug discovery. Cell Rep Med. 2022;3:100794. https://doi.org/10.1016/j.xcrm.2022.100794.
Vert JP. How will generative ai disrupt data science in drug discovery? Nat Biotechnol. 2023;41:750–1. https://doi.org/10.1038/s41587-023-01789-6.
Diao Y, Liu D, Ge H, Zhang R, Jiang K, Bao R, et al. Macrocyclization of linear molecules by deep learning to facilitate macrocyclic drug candidates discovery. Nat Commun. 2023;14(1):4552.
Flam-Shepherd D, Zhu K, Aspuru-Guzik A. Language models can learn complex molecular distributions. Nat Commun. 2022;13(1): 3293.
Mahmood O, Mansimov E, Bonneau R, Cho K. Masked graph modeling for molecule generation. Nat Commun. 2021;12(1):3156.
Yang X, Fu L, Deng Y, Liu Y, Cao D, Zeng X. GPMO: Gradient Perturbation-Based Contrastive Learning for Molecule Optimization. In: IJCAI. 2023. pp. 4940–8.
Jin W, Barzilay R, Jaakkola T. Junction tree variational autoencoder for molecular graph generation. In: International conference on machine learning. PMLR; 2018. pp. 2323–32.
Jin W, Barzilay R, Jaakkola T. Hierarchical generation of molecular graphs using structural motifs. In: International conference on machine learning. Online: PMLR; 2020. pp. 4839–48.
Xue D, Zhang H, Chen X, Xiao D, Gong Y, Chuai G, et al. X-MOL: large-scale pre-training for molecular understanding and diverse molecular analysis. Sci Bull. 2022;67(9):899–902.
Zhang Z, Liu Q, Wang H, Lu C, Lee CK. Motif-based graph self-supervised learning for molecular property prediction. Adv Neural Inf Process Syst. 2021;34:15870–82.
You Y, Chen T, Sui Y, Chen T, Wang Z, Shen Y. Graph contrastive learning with augmentations. Adv Neural Inf Process Syst. 2020;33:5812–23.
Xiang H, Jin S, Xia J, Zhou M, Wang J, Zeng L, et al. An image-enhanced molecular graph representation learning framework. In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence. Jeju, Korea: IJCAI; 2024. pp. 6107–15.
Luo S, Chen T, Xu Y, Zheng S, Liu TY, Wang L, et al. One Transformer Can Understand Both 2D & 3D Molecular Data. In: The Eleventh International Conference on Learning Representations. Kigali, Rwanda: ICLR; 2023.
Guo Z, Sharma P, Martinez A, Du L, Abraham R. Multilingual Molecular Representation Learning via Contrastive Pre-training. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: ACL; 2022. pp. 3441–53.
Li H, Zhang R, Min Y, Ma D, Zhao D, Zeng J. A knowledge-guided pre-training framework for improving molecular representation learning. Nat Commun. 2023;14(1):7568.
Xiang H, Zeng L, Hou L, Li K, Fu Z, Qiu Y, et al. A molecular video-derived foundation model for scientific drug discovery. Nat Commun. 2024;15(1):9696.
Hou L, Xiang H, Zeng X, Cao D, Zeng L, Song B. Attribute-guided prototype network for few-shot molecular property prediction. Brief Bioinform. 2024;25(5): bbae394.
Zhang X, Xiang H, Yang X, Dong J, Fu X, Zeng X, et al. Dual-view learning based on images and sequences for molecular property prediction. IEEE J Biomed Health Inform. 2023;28(3):1564–74.
Xia J, Zhao C, Hu B, Gao Z, Tan C, Liu Y, et al. Mole-bert: Rethinking pre-training graph neural networks for molecules. In: The Eleventh International Conference on Learning Representations. Virtual Event: ICLR; 2022.
Zeng X, Xiang H, Yu L, Wang J, Li K, Nussinov R, et al. Accurate prediction of molecular properties and drug targets using a self-supervised image representation learning framework. Nat Mach Intell. 2022;4(11):1004–16.
Hendrickson JB. Concepts and applications of molecular similarity. Science. 1991;252(5009):1189–90.
Stumpfe D, Bajorath J. Exploring activity cliffs in medicinal chemistry: miniperspective. J Med Chem. 2012;55(7):2932–42.
Wedlake AJ, Folia M, Piechota S, Allen TE, Goodman JM, Gutsell S, et al. Structural alerts and random forest models in a consensus approach for receptor binding molecular initiating events. Chem Res Toxicol. 2019;33(2):388–401.
van Tilborg D, Alenicheva A, Grisoni F. Exposing the limitations of molecular machine learning with activity cliffs. J Chem Inf Model. 2022;62(23):5938–51.
Deng J, Yang Z, Wang H, Ojima I, Samaras D, Wang F. A systematic study of key elements underlying molecular property prediction. Nat Commun. 2023;14(1):6395.
Xia J, Zhang L, Zhu X, Liu Y, Gao Z, Hu B, et al. Understanding the limitations of deep models for molecular property prediction: Insights and solutions. Adv Neural Inf Process Syst. 2023;36:64774–92.
Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. In: International Conference on Learning Representations. France: ICLR; 2017.
Veličković P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y. Graph Attention Networks. In: International Conference on Learning Representations. Vancouver, Canada: ICLR; 2018.
Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE. Neural message passing for quantum chemistry. In: International conference on machine learning. Sydney, NSW, Australia: PMLR; 2017. pp. 1263–72.
Li Q, Han Z, Wu XM. Deeper insights into graph convolutional networks for semi-supervised learning. In: Proceedings of the AAAI conference on artificial intelligence. Louisiana, USA: AAAI; 2018. vol. 32.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Las Vegas, USA: IEEE; 2016. pp. 770–8.
Rogers D, Hahn M. Extended-connectivity fingerprints. J Chem Inf Model. 2010;50(5):742–54.
Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). Florence, Italy: ACL; 2019. pp. 4171–86.
He K, Chen X, Xie S, Li Y, Dollár P, Girshick R. Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. LA, USA: IEEE; 2022. pp. 16000–9.
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: International Conference on Learning Representations. Virtual Event: ICLR; 2021.
Kim W, Son B, Kim I. Vilt: Vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning. Online: PMLR; 2021. pp. 5583–94.
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33:1877–901.
Radford A, Narasimhan K, Salimans T, Sutskever I, et al. Improving language understanding by generative pre-training. OpenAI blog. 2018. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
Hu Y, Bajorath J. Extending the activity cliff concept: structural categorization of activity cliffs and systematic identification of different types of cliffs in the ChEMBL database. J Chem Inf Model. 2012;52(7):1806–11.
Stumpfe D, Hu H, Bajorath J. Advances in exploring activity cliffs. J Comput Aided Mol Des. 2020;34:929–42.
Chithrananda S, Grand G, Ramsundar B. ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction. 2020. Preprint at https://doi.org/10.48550/arXiv.2010.09885.
Rong Y, Bian Y, Xu T, Xie W, Wei Y, Huang W, et al. Self-supervised graph transformer on large-scale molecular data. Adv Neural Inf Process Syst. 2020;33:12559–71.
Wang Y, Wang J, Cao Z, Barati Farimani A. Molecular contrastive learning of representations via graph neural networks. Nat Mach Intell. 2022;4(3):279–87.
Hu W, Liu B, Gomes J, Zitnik M, Liang P, Pande V, et al. Strategies for Pre-training Graph Neural Networks. In: International Conference on Learning Representations. Virtual Event: ICLR; 2020.
Wu F, Qin H, Gao W, Li S, Coley CW, Li SZ, et al. InstructBio: A Large-scale Semi-supervised Learning Paradigm for Biochemical Problems. 2023. arXiv preprint arXiv:2304.03906.
Fang X, Liu L, Lei J, He D, Zhang S, Zhou J, et al. Geometry-enhanced molecular representation learning for property prediction. Nat Mach Intell. 2022;4(2):127–34.
Stärk H, Beaini D, Corso G, Tossou P, Dallago C, Günnemann S, et al. 3d infomax improves gnns for molecular property prediction. In: International Conference on Machine Learning. Baltimore, Maryland, USA: PMLR; 2022. pp. 20479–502.
Liu S, Wang H, Liu W, Lasenby J, Guo H, Tang J. Pre-training Molecular Graph Representation with 3D Geometry. In: International Conference on Learning Representations. Virtual Event: ICLR; 2022.
Xiang H, Jin S, Liu X, Zeng X, Zeng L. Chemical structure-aware molecular image representation learning. Brief Bioinform. 2023;24(6): bbad404.
Zhang T. An introduction to support vector machines and other kernel-based learning methods. AI Mag. 2001;22(2):103.
Breiman L. Bagging predictors. Mach Learn. 1996;24:123–40.
Fix E, Hodges JL. Discriminatory analysis: nonparametric discrimination, consistency properties. Int Stat Rev. 1989;57(3):238–47.
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–44.
Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001;29(5):1189–1232. https://doi.org/10.1214/aos/1013203451.
Kullback S, Leibler RA. On information and sufficiency. Ann Math Statist. 1951;22(1):79–86.
Lee DH, et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on Challenges in Representation Learning, PMLR; Vol. 3, 2013, no. 2.
Torng W, Altman RB. Graph convolutional neural networks for predicting drug-target interactions. J Chem Inf Model. 2019;59(10):4131–49.
Sakai M, Nagayasu K, Shibui N, Andoh C, Takayama K, Shirakawa H, et al. Prediction of pharmacological activities from chemical structures with graph convolutional neural networks. Sci Rep. 2021;11(1):525.
Li P, Wang J, Qiao Y, Chen H, Yu Y, Yao X, et al. An effective self-supervised framework for learning expressive molecular global representations to drug discovery. Brief Bioinform. 2021;22(6): bbab109.
Fay MP, Proschan MA. Wilcoxon-mann-whitney or t-test? On assumptions for hypothesis tests and multiple interpretations of decision rules. Stat Surv. 2010;4: 1.
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision. Hawaii: IEEE; 2017. pp. 618–26.
Luo D, Cheng W, Xu D, Yu W, Zong B, Chen H, et al. Parameterized explainer for graph neural network. Adv Neural Inf Process Syst. 2020;33:19620–31.
Wu Z, Wang J, Du H, Jiang D, Kang Y, Li D, et al. Chemistry-intuitive explanation of graph neural networks for molecular property prediction with substructure masking. Nat Commun. 2023;14(1):2585.
Wu Z, Jiang D, Wang J, Hsieh CY, Cao D, Hou T. Mining toxicity information from large amounts of toxicity data. J Med Chem. 2021;64(10):6924–36.
Xu C, Cheng F, Chen L, Du Z, Li W, Liu G, et al. In silico prediction of chemical Ames mutagenicity. J Chem Inf Model. 2012;52(11):2840–7.
Polishchuk PG, Kuz’min VE, Artemenko AG, Muratov EN. Universal approach for structural interpretation of QSAR/QSPR models. Mol Inform. 2013;32(9–10):843–53.
Peng S, Hu P, Xiao YT, Lu W, Guo D, Hu S, et al. Single-cell analysis reveals EP4 as a target for restoring T-cell infiltration and sensitizing prostate cancer to immunotherapy. Clin Cancer Res. 2022;28(3):552–67.
Liu T, Lin Y, Wen X, Jorissen RN, Gilson MK. BindingDB: a web-accessible database of experimentally determined protein–ligand binding affinities. Nucleic Acids Res. 2007;35(Suppl_1):D198-201.
Nakao K, Murase A, Ohshiro H, Okumura T, Taniguchi K, Murata Y, et al. CJ-023,423, a novel, potent and selective prostaglandin EP4 receptor antagonist with antihyperalgesic properties. J Pharmacol Exp Ther. 2007;322(2):686–94.
He J, Lin X, Meng F, Zhao Y, Wang W, Zhang Y, et al. A novel small molecular prostaglandin receptor EP4 antagonist, L001, suppresses pancreatic cancer metastasis. Molecules. 2022;27(4):1209.
Murase A, Okumura T, Sakakibara A, Tonai-Kachi H, Nakao K, Takada J. Effect of prostanoid EP4 receptor antagonist, CJ-042,794, in rat models of pain and inflammation. Eur J Pharmacol. 2008;580(1–2):116–21.
Blouin M, Han Y, Burch J, Farand J, Mellon C, Gaudreault M, et al. The discovery of 4-\(\{\)1-[(\(\{\)2, 5-dimethyl-4-[4-(trifluoromethyl) benzyl]-3-thienyl\(\}\) carbonyl) amino] cyclopropyl\(\}\) benzoic acid (MK-2894), a potent and selective prostaglandin E2 subtype 4 receptor antagonist. J Med Chem. 2010;53(5):2227–38.
Caselli G, Bonazzi A, Lanza M, Ferrari F, Maggioni D, Ferioli C, et al. Pharmacological characterisation of CR6086, a potent prostaglandin E 2 receptor 4 antagonist, as a new potential disease-modifying anti-rheumatic drug. Arthritis Res Ther. 2018;20:1–19.
Kotani T, Takano H, Yoshida T, Hamasaki R, Kohanbash G, Takeda K, et al. Inhibition of PGE2/EP4 pathway by ONO-4578/BMS-986310, a novel EP4 antagonist, promotes T cell activation and myeloid cell differentiation to dendritic cells. Cancer Res. 2020;80(16_Supplement):4443.
Albu DI, Wang Z, Huang KC, Wu J, Twine N, Leacu S, et al. EP4 antagonism by E7046 diminishes myeloid immunosuppression and synergizes with Treg-reducing IL-2-diphtheria toxin fusion protein in restoring anti-tumor immunity. Oncoimmunology. 2017;6(8):e1338239.
Jin Y, Liu Q, Chen P, Zhao S, Jiang W, Wang F, et al. A novel prostaglandin E receptor 4 (EP4) small molecule antagonist induces articular cartilage regeneration. Cell Discov. 2022;8(1):24.
Das D, Qiao D, Liu Z, Xie L, Li Y, Wang J, et al. Discovery of novel, selective prostaglandin EP4 receptor antagonists with efficacy in cancer models. ACS Med Chem Lett. 2023;14(6):727–36.
Degen J, Wegscheid-Gerlach C, Zaliani A, Rarey M. On the Art of Compiling and Using ‘Drug-Like’ Chemical Fragment Spaces. ChemMedChem. 2008;3(10):1503–7.
Kim S, Chen J, Cheng T, Gindulyte A, He J, He S, et al. PubChem 2019 update: improved access to chemical data. Nucleic Acids Res. 2019;47(D1):D1102–9.
Chen C, Ye W, Zuo Y, Zheng C, Ong SP. Graph networks as a universal machine learning framework for molecules and crystals. Chem Mater. 2019;31(9):3564–72.
Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems. 2012;25:1097–105.
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80.
Cheng Z, Xiang H, Ma P, Zeng L, Jin X, Yang X, et al. MaskMol: knowledge-guided molecular image pre-training framework for activity cliffs with Pixel Masking. Zenodo. 2025. https://doi.org/10.5281/zenodo.15834481.
Acknowledgements
The work was supported by National Natural Science Foundation of China (Grant Nos. 62450002, 62202153, 62272151, 62372159, 62302156, 61972138, 62106073, 62122025, 62102140 and U22A2037), Hunan Provincial Natural Science Foundation of China (Grant Nos. 2024JJ4015, 2023JJ40180, 2022JJ20016 and 2021JJ10020), The Science and Technology Innovation Program of Hunan Province (Grant Nos. 2022RC1100, 2022RC1099), Postgraduate Scientific Research Innovation Project of Hunan Province (Grant No. CX20220380), the project of Hunan Provincial Key Laboratory of Anti-Resistance Micro-bial Drugs (No:2023TP1013).
Funding
The work was supported by National Natural Science Foundation of China (Grant Nos. 62450002, 62202153, 62272151, 62372159, 62302156, 61972138, 62106073, 62122025, 62102140 and U22A2037), Hunan Provincial Natural Science Foundation of China (Grant Nos. 2024JJ4015, 2023JJ40180, 2022JJ20016 and 2021JJ10020), The Science and Technology Innovation Program of Hunan Province (Grant Nos. 2022RC1100, 2022RC1099), Postgraduate Scientific Research Innovation Project of Hunan Province (Grant No. CX20220380), the project of Hunan Provincial Key Laboratory of Anti-Resistance Micro-bial Drugs (No:2023TP1013).
Author information
Authors and Affiliations
Contributions
B.S. conceived the study. Z.C. and H.X. implemented the pipeline, constructed the databases, developed the codes, and performed all experiments. Z.C., H.X., P.M., J.L, X.J, X.Y, B.S., L.Z., X.F., Y.D., C.D., and X.Z. performed data analyses. Z.C., H.X., and B.S. discussed and interpreted all results. B.S., H.X., and X.Z. wrote and critically revised the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
12915_2025_2389_MOESM1_ESM.docx
Additional file 1: Contains comparison with more baselines, pre-training dataset scale analysis, potential optimizations, computational efficiency, and statistical significance. Tables S1 to S4 report performance evaluations using 3D-based models, and Tables S5 and S6 focus on multi-modal models. Tables S7 and S8 present the impact of pre-training data size on activity cliff estimation using RMSE and RMSEcliff, respectively. Table S9 demonstrates performance improvements achieved by transferring knowledge from image-based representations to graph structures. Table S10 validates that explicit 3D structural encoding significantly enhances cliff prediction. Table S11 summarizes computational efficiency during both pre-training and fine-tuning stages. Table S12 provides statistical significance test results. Figures S1 to S3 offer visual evidence, including cliff molecule conformation, comparisons between 2D and 3D fingerprints, and structural visualization of cliff molecules.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Cheng, Z., Xiang, H., Ma, P. et al. MaskMol: knowledge-guided molecular image pre-training framework for activity cliffs with pixel masking. BMC Biol 23, 279 (2025). https://doi.org/10.1186/s12915-025-02389-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12915-025-02389-3