- Methodology
- Open access
- Published:
YModPred: an interpretable prediction method for multi-type RNA modification sites in S. cerevisiae based on deep learning
BMC Biology volume 23, Article number: 272 (2025)
Abstract
Background
RNA post-transcriptional modifications involve the addition of chemical groups to RNA molecules or alterations to their local structure. These modifications can change RNA base pairing, affect thermal stability, and influence RNA folding, thereby impacting alternative splicing, translation, cellular localization, stability, and interactions with proteins and other molecules. Accurate prediction of RNA modification sites is essential for understanding modification mechanisms.
Results
We propose a novel deep learning model, YModPred, which accurately predicts multiple types of RNA modification sites in S. cerevisiae based on RNA sequences. YModPred combines convolution and self-attention mechanisms to enhance the model’s ability to capture global sequence information and improve local feature learning. The model can predict multi-type RNA modification sites. Comparative analysis against benchmark models demonstrates that YModPred outperforms existing state-of-the-art methods in predicting various RNA modification types. Additionally, the model’s prediction performance is further validated through visualization and motif analysis.
Conclusions
YModPred is a deep learning-based model that effectively captures sequence features and dependencies, enabling accurate prediction of multi-type RNA modification sites in S. cerevisiae. We believe it will facilitate further research into the mechanisms of RNA modifications.
Background
Post-transcriptional RNA modifications refer to chemical changes that occur in RNA molecules after transcription is completed. These modifications typically involve the addition of chemical groups to ribonucleotides or alterations to the local structure of RNA [1], and they are commonly found in mRNA, tRNA, rRNA, and non-coding RNA [2, 3]. RNA modifications play a crucial role in cell growth, development, and various biological functions [4]. To date, more than 100 types of RNA modifications have been discovered [5, 6], but only a select few are commonly studied for predictive purposes, primarily including N6-methyladenosine (m6A), 5-methylcytosine (m5C), N1-methyladenosine (m1A), pseudouridine (Ψ), 5-formylcytosine (f5C), 5-methyluridine (m5U), N1-methylguanosine (m1G), N2-methylguanosine (m2G), N2, N2-dimethylguanosine (m2,2G), and dihydrouridine (D). Different types of modification sites have distinct functional mechanisms. For example, m6A modification is closely related to RNA stability, splicing, translation, and cell differentiation [7]. m5C modification is associated with various biological processes, including metabolism, mRNA regulation, and stress responses [8]. Additionally, research has shown that different types of RNA modifications are linked to the occurrence and progression of various diseases [9]. Therefore, accurately identifying RNA modification sites not only provides deeper insights into gene expression and modification mechanisms but also plays a significant role in the discovery of new therapeutic targets and drug development.
RNA modification sites can be identified using various high-resolution methods, such as m6A-CLIP, m5C-RIP, m1A-seq, Ψ-seq, and techniques based on RNA bisulfite sequencing [10,11,12,13,14]. Although these sequencing technologies and certain biological experimental methods have proven to be reliable, they are typically time-consuming, labor-intensive, and not suitable for large-scale data analysis, with relatively low detection efficiency. Consequently, researchers have developed computational methods based on machine learning [15], framing the prediction of RNA modification sites as a binary classification task and training machine learning models to distinguish between modification and non-modification sites.
In recent years, many specific prediction tools based on machine learning or deep learning have been developed for different types of modifications. For example, in predicting multi-type modification sites, Feng et al. [16] proposed iRNA-PseColl, a computational method that utilizes sequence information and the support vector machine (SVM) algorithm. Through pseudo K-tuple nucleotide composition (PseKNC) encoding, this method can identify m6A, m1A, and m5C modification sites. The iRNA-3typeA method [17] uses SVM and PseKNC to construct a prediction model capable of identifying m1A, m6A, and adenosine-to-inosine (A-to-I) modification sites in both H. sapiens and M. musculus transcriptomes. The iMRM method [18] combines extreme gradient boosting (XGBoost) with various feature encodings to predict m6A, m5C, m1A, Ψ, and A-to-I modification sites in H. sapiens, M. musculus, and S. cerevisiae and also offers a convenient online prediction platform. The DeepPromise [19] method effectively predicts m1A and m6A using deep learning techniques combined with various feature encoding methods. The iRNA-Mod-CNN method [20] employs binary encoding as input and uses a convolutional neural network (CNN) to extract key deep features, which are then combined with Kmer features to predict m1A, m6A, and m5C. MultiRM [21] is a multi-label learning deep learning model based on the attention mechanism, capable of predicting 12 different types of RNA modification sites in H. sapiens. Rm-LR [22] is a novel RNA modification site prediction method based on long-range deep learning and an RNA language pre-training model. It can identify 10 types of RNA modification sites in H. sapiens, significantly outperforming existing methods and improving the model’s interpretability.
Although many computational methods have been developed to identify RNA modification sites, advancing the study of different modification types, existing methods still have certain limitations. First, most computational methods focus only on a few common types of modifications, especially in non-H. sapiens organisms like S. cerevisiae, where these methods can identify at most four types of modifications. Second, while some models perform well, there remains room for improvement in interpretability. Finally, prediction methods developed for S. cerevisiae typically use shorter sequences, with longer sequences rarely considered.
In this study, we propose a deep learning prediction method named YModPred, which can accurately predict multiple types of RNA modification sites in S. cerevisiae (The overall framework is shown in Fig. 1). The prediction model utilizes a novel encoder that effectively combines convolution modulation and self-attention mechanisms. This combination not only optimizes the ability to capture global sequence information but also enhances the efficiency of learning local features, thereby uncovering deeper sequence information to improve prediction performance. Additionally, we explore the interpretability of the prediction model through feature visualization and motif analysis. Simultaneously, in silico mutagenesis is employed to conduct an in-depth analysis of the importance of sequences surrounding the modification sites. YModPred supports the prediction of ten types of modifications, including m6A, m5C, m1A, Ψ, f5C, D, m5U, m1G, m2G, and m2,2G. Experimental results demonstrate that YModPred’s prediction ability surpasses that of existing state-of-the-art (SOTA) methods.
Results and discussions
Prediction performance of YModPred on multi-type modification sites
We first evaluated the performance of YModPred using five-fold cross-validation (fivefold CV) and independent testing. The results are presented in Table 1. For m1G, m2G, and m2,2G, only a fivefold CV was performed. For m6A, YModPred achieved an ACC of over 81% in both fivefold CV and independent testing while maintaining a good balance between Sn and Sp. Similarly, for f5C, the ACC was 81.34% in fivefold CV and 80.78% in independent testing, with Sn and Sp being nearly balanced in both cases. For m5C, an ACC of 83.66% and an AUC of 0.900 were achieved in a fivefold CV. In independent testing, the ACC was 84.24%, and the AUC was 0.912. When evaluating specific modification sites such as D and m5U, the model showed particularly good predictive performance. For example, in the prediction of D, the model achieved an ACC of 98.15% and an AUC of 0.996 in independent testing, with similarly high performance in a fivefold CV. However, it is important to note that due to the smaller sample size for these two modifications, these metrics may be overestimated, and further data collection is needed in the future to validate and confirm these results. A similar situation applies to m1G, m2G, and m2,2G, for which only high fivefold CV results were available due to the absence of independent test sets. In addition, YModPred's performance for m1A and Ψ is generally average. For example, the ACC for Ψ modification in fivefold CV and independent testing are 77.02% and 78.07%, respectively. In summary, YModPred obtained high ACC values for the prediction of multi-type modification sites, indicating strong overall predictive ability.
Performance comparison with traditional feature-based classification models
To more comprehensively evaluate the performance of the proposed prediction model in identifying RNA multi-type modification sites, we compared our model with several prediction methods based on traditional hand-crafted features. These traditional features include Kmer, nucleotide acid composition (NAC) [23], accumulated nucleotide frequency (ANF) [24], composition of k-spaced nucleic acid pairs (CKSNAP) [25], adaptive skip dinucleotide composition (ASDC) [26], dinucleotide-based auto covariance (DAC) [27], and pseudo dinucleotide composition (PseDNC) [23] (for detailed descriptions, see the Additional file1). Based on these features, four different classical classification algorithms were used for model training and performance evaluation. These algorithms include light gradient boosting machines (LGBM), logistic regression (LR), random forest (RF), and SVM. The comparison results are shown in Fig. 2. As illustrated in Fig. 2A–D, Kmer exhibits excellent performance across most classifiers, especially with SVM, where it achieves an ACC superior to all other features. Although NAC and PseDNC also perform well across most classifiers, their performance varies among different algorithms. Notably, the effect of NAC on the RF classifier is slightly inferior to that of Kmer and PseDNC. It is worth mentioning that although ANF generally performed well on some classifiers, such as LGBM and SVM, it showed high accuracy in predicting m5U, suggesting that ANF may have unique advantages in addressing specific types of modification predictions. In contrast, DAC exhibits weak overall performance across all classifiers, with the lowest accuracy on LR, indicating that it is not well-suited for the objectives of this research. Additionally, Kmer not only performs well on various modifications but also maintains consistently high accuracy across different classifiers.
Subsequently, we compared the ACC, AUC, and MCC values of models trained with Kmer on four traditional classifiers (LGBM, LR, RF, and SVM) with those of YModPred, as shown in Fig. 3. The experimental results indicate that YModPred outperforms traditional hand-crafted features in identifying all types of modification sites. Notably, it demonstrates significant advantages in modification sites such as D, m5U, m1G, m2G, and m2,2G, achieving an ACC of over 98%, with both AUC and MCC above 0.9. In contrast, the best traditional classifier, SVM, only reaches an ACC of 85% to 89% for these modification sites. Additionally, for m6A and m5C modification site predictions, YModPred achieved an ACC improvement of 1.77% to 3.96%. By comparing the four traditional classifiers, the developed prediction model YModPred demonstrated significant advantages in ACC, AUC, and MCC for multi-type RNA modification sites, fully highlighting the superior performance of this predictive model.
To effectively understand the learning effectiveness and feature representation capability of the model, the t-distributed stochastic neighbor embedding (t-SNE) method was employed to visualize and compare the features learned by the model with the traditional Kmer feature. This visualization illustrates the distribution and discriminative ability of different features across five types of modification sites, as shown in Fig. 4. The comparison reveals that in the t-SNE visualizations, the feature representation learned by the prediction model YModPred exhibited better clustering effects and clearer boundaries in low-dimensional space compared to the Kmer features. For example, in the feature visualization learned by YModPred for m5C, the positive and negative samples can be distinguished, while with the Kmer feature, there is no obvious boundary between the positive and negative samples. This indicates that the model has learned more discriminative feature representations. These features maintain good separability in the reduced-dimensional space, enhancing the accuracy of the modification site identification task.
Multi-type RNA modification motif analysis
YModPred employs a multi-head attention mechanism to mine deeper sequence information, which fully accesses the input RNA sequence and selects the elements that are critical for generating prediction results. This mechanism enables the model to allocate corresponding attention to specific nucleotides in the RNA sequence based on the requirements of different prediction tasks. By visualizing the attention weights—which indicate the relative importance of each nucleotide in the input RNA sequence—we can see which parts of the input sequence the model focuses on when making predictions. Therefore, by extracting and visualizing the attention weights for different RNA modification sites, we can more intuitively demonstrate the learning capability of the model.
The comparison results between the motifs learned by YModPred and the motifs discovered by the traditional tool STREME [28] are shown in Fig. 5. Each motif represents a specific nucleotide sequence pattern, which is represented by a graphical stacking of bases (A, C, G, U), where the vertical size of each base reflects its relative frequency or conservation in the sequence. The results indicate that most motifs align with the sequence patterns revealed by STREME. To evaluate the similarity between the motifs obtained by YModPred and STREME, we used the motif comparison tool TOMTOM [29]. TOMTOM calculates p-values to quantify the similarity between motifs, with smaller p-values indicating greater similarity. It employs a statistical hypothesis testing framework, where the null hypothesis assumes that the observed alignment between two motifs arises by chance, based on a random background model. Under this framework, the p-value represents the likelihood of observing the given or higher alignment similarity if the null hypothesis is true. A smaller p-value suggests that the observed similarity is unlikely to be due to chance, indicating a significant similarity between the motifs. For example, for m5C, using TOMTOM to compare the motif discovered by STREME with the motif learned by YModPred yielded a p-value of 1.50e-04. This small p-value indicates a high consistency between the two motifs. Highly similar motifs were also found for other modifications. This confirms that the developed model can learn conserved sequence features and demonstrates the model’s strong ability to learn features.
It is worth noting that the motifs learned by YModPred show a higher emphasis on the modification sites, as indicated by the increased information content, measured in bits (a unit from information theory indicating sequence conservation), at the modification positions compared to the flanking regions. This observation suggests that YModPred, as a deep learning-based model, prioritizes the most discriminative and conserved sequence features directly linked to the modification site to optimize prediction performance. In contrast, STREME, as a traditional motif discovery tool, distributes the information content more evenly across the sequence context, as it aims to identify statistically significant patterns over a broader region. This contrast reflects the differing objectives of the two methods: YModPred focuses on features most relevant to accurate modification prediction, while STREME identifies general sequence motifs with statistical enrichment.
Analysis of modification site sequence length and regional importance
We conducted in silico mutagenesis (ISM) [30] experiments to evaluate and interpret the bioinformatics model and to explore the importance of different regions in the RNA sequence. The core of the ISM experiment involves systematically mutating single nucleotides in the RNA sequence and observing how each mutation affects the model’s prediction output. This process allows for an in-depth analysis of the features and patterns learned by the model. Specifically, for the RNA sequences in this study, which are 601 nucleotides long, mutations were made at all positions except for the central one. This process helps identify which changes significantly impact model predictions and infers the importance of these positions in the RNA modification process.
For the sequence region analysis of multi-type modification sites, we used binning to analyze the effects in detail. We divided the 300 positions on either side of the central site into 15 segments, with each segment containing 20 original positions. This method produced the ISM experiment results, with each group containing 20 positions. The key metric is the absolute difference between the prediction values of the mutated sequence and the original sequence. By calculating the mean and variance of the differences for positive and negative samples in each box, the statistical results are shown in Fig. 6, thereby exploring the specific impact of position changes in different regions on model predictions. Specifically, for m6A, importance gradually increases in the regions from − 15 to − 9 and 9 to 15, with a faster increase in the middle section. The importance of m5C is more significant in the − 2 to 2 area, while other areas show considerable fluctuations. The importance of m1A is mainly concentrated in the − 1 to 2 region, while Ψ shows importance in the − 2 to 2 region, with larger fluctuations in other areas. The importance of f5C is concentrated in the − 1 to 1 region, with relatively small fluctuations elsewhere. The importance of D is primarily in the − 1 to − 2 region, with lower importance in other areas. m5U shows higher importance in the − 4 to 2 region. For m1G and m2G, importance is mainly concentrated in the − 1 to 2 region. The importance of m2,2G is primarily at positions − 1 and 2, with some fluctuations in other positions. In summary, it can be observed that the 10 types of RNA modifications do not follow a uniform pattern in their important regions. However, it is evident that the central region, near the modification site, holds greater importance for model predictions, and this phenomenon exhibits asymmetry on both sides. These analysis results reveal the distribution of importance for different RNA modification sites within the sequence, providing valuable insights for further understanding the role of these modifications in RNA function.
Comparison with the state-of-the-art methods
To evaluate the performance of YModPred on multi-type RNA modification sites, we compared it with the SOTA methods. For the existing methods, we retrained these models using the updated dataset and compared them with YModPred; the results are summarized in Table 2. The findings show that YModPred significantly outperforms the existing methods on most types of modification sites. For example, in the prediction of m6A, YModPred achieved an ACC of 82.57% in independent testing, which is 12.83% higher than iMRM [18]. Notably, in the prediction of Ψ and m1A, YModPred not only surpassed other methods in terms of ACC but also demonstrated superior performance in important metrics such as AUC and MCC in independent testing. Similarly, for m5C, YModPred achieved an ACC of 84.24%, an AUC of 0.912, a Sn of 84.24%, a Sp of 84.24%, and an MCC of 0.685, outperforming both iMRM and iRNA-m5C [18, 31]. Additionally, for D and m5U, the prediction performance of YModPred is also better than that of the existing methods. These results indicate that YModPred exhibits high stability and exceptional performance in predicting multi-type RNA modification sites, demonstrating its significant advantages and broad application potential.
Conclusion
RNA modifications are essential for regulating gene expression, RNA stability, and translation efficiency. Accurate prediction of these modifications is crucial for understanding their functional significance and underlying regulatory mechanisms. Despite their importance, current research on modification site prediction in S. cerevisiae has been limited to only a few types. In this study, we propose a deep learning model called YModPred to predict multi-type RNA modification sites in S. cerevisiae. The YModPred model effectively captures local sequence features and long-range dependencies in the sequence by combining convolutional modulation and multi-head attention mechanisms, achieving high-accuracy RNA modification site predictions. Cross-validation and independent testing results show that YModPred performs well on ten types of RNA modifications. Furthermore, comparative experiments show that YModPred outperforms existing methods in predicting various types of modification sites. Additionally, by analyzing and comparing the motifs of different modification sites, we found that the model learned conserved sequence features. Finally, we also used ISM experiments to study the importance of different RNA modification site sequence regions.
While YModPred demonstrates promising performance across multiple RNA modification types, several limitations should be acknowledged. First, the limited availability of training and testing data for most modification types—apart from m6A—makes it difficult to ensure robustness and generalizability. In particular, the high accuracy observed for rare modifications should be interpreted with caution, as these results may reflect data scarcity rather than true predictive capability. Independent test sets were unavailable for some types (e.g., m1G, m2G, and m2,2G), and although fivefold CV provided internal validation, it cannot fully substitute for external validation. Consequently, these performance metrics may overestimate real-world applicability. Furthermore, the small sample sizes for rare modifications inherently increase the risk of overfitting. While interpretability analyses suggest that YModPred captures biologically meaningful features, translating these insights into concrete biological understanding remains an open challenge. To further validate the performance of YModPred and address these limitations, future work will incorporate strategies such as transfer learning, data augmentation, and cross-species integration, and will leverage newly available data to improve the model’s robustness, scalability, and interpretability.
Methods
Dataset collection and processing
In this study, the sequence data for multiple types of RNA modification sites in S. cerevisiae were primarily obtained from various databases, including RMDiseaseV2.0, RMBase v2.0, and related literature [35,36,37]. The modification site types involved are m6A, m5C, m1A, Ψ, m5U, f5C, m1G, m2G, m2,2G, and D. The m6A modification site data were sourced from the MTDeepM6A-2S method developed by Wang et al. [38]. The collection of data for other modification sites followed these steps: first, the positional information for the modification sites of different types was obtained from relevant databases. Next, the entire chromosome sequences were downloaded. Based on the specific positions of the modification sites, RNA sequences containing the modification sites (601 bp) were extracted from the chromosome sequences as positive samples. Accordingly, negative samples were also extracted from the same RNA transcripts as their corresponding positive samples. Specifically, sequences of the same length that did not contain modification sites were used. To minimize the impact of sequence similarity on model performance, CD-HIT [39] was used to remove samples with sequence similarity higher than 70%, thereby constructing a high-quality benchmark dataset. Each dataset was constructed to be class-balanced, containing an equal number of positive and negative samples, all of fixed length (601 bp). Finally, the datasets for different modification sites were divided into 80% training data and 20% independent test set. The detailed information about the ten modification site benchmark datasets is presented in Table 3.
Model construction
The YModPred model architecture consists of three main components: the embedding module, the encoder module, and the prediction module, as shown in Fig. 1B. First, the embedding layer converts the nucleotides in the RNA sequence into numerical vector representations that can be processed by the neural network. This conversion captures the biochemical properties of the nucleotides and their specific positions within the sequence. Next, the encoder module combines convolution and multi-head attention mechanisms to effectively capture both local and global features of RNA sequences. Finally, the prediction layer maps the output of the encoder to a two-dimensional space for classification through a series of fully connected layers, predicting modification sites in RNA sequences. Below, we provide a detailed description of these three key components.
Embedding module
The primary function of the embedding layer is to convert the RNA sequence into a unique matrix composed of two key components: nucleotide embedding and positional encoding [40]. First, the nucleotide embedding maps each nucleotide in the RNA sequence into a vector space of predefined dimension \({d}_{m}\) (the embedding dimension), which captures the biochemical properties of the nucleotide. Second, the positional encoding adds positional information to each nucleotide vector. By utilizing fixed-frequency positional encoding, calculated through a combination of \(\text{Sin}\) and \(\text{Cos}\) functions, the model can determine the specific position of each nucleotide in the sequence. The positional encoding is combined with the nucleotide embedding vectors through simple addition, enabling the positional information to be incorporated while preserving the original semantic information [40]. The mathematical description of the embedding module is as follows:
where \({N}_{i}\) represents the \(i\)-th nucleotide in the sequence. \({E}_{i}\) is the nucleotide embedding vector of \({N}_{i}\), with a dimensionality of \({d}_{m}\) set to 32. \(PE\left(i\right)\) represents the positional encoding for \({N}_{i}\), where \(j\) is the index of the embedding dimension. The final embedding of \({N}_{i}\), \({\text{Embedding}}_{\text{final}}\left({N}_{i}\right)\), is obtained by combining its learnable embedding vector \({E}_{i}\) with the corresponding positional encoding \(PE\left(i\right)\).
Encoder module
The output feature matrix generated by the embedding module is subsequently fed into the encoder module to extract higher-level representations of the RNA sequence. In this study, we used a novel encoder module that combines a convolutional neural network (CNN) encoder and the transformer encoder, aiming to improve the efficiency in capturing both local and global features within RNA sequences [41]. Specifically, the CNN encoder in this model is inspired by the Conv2Former structure used in the visual field [42], but it has been modified to adapt to the characteristics of RNA sequences by replacing the original 2D convolution with 1D convolution. This adaptation makes the encoder more suitable for processing one-dimensional biological sequence data. The transformer encoder primarily consists of a multi-head attention mechanism, a feed-forward neural network, and residual connections. The core components of the encoder, including convolutional modulation and the multi-head attention mechanism, are detailed as follows:
Convolutional modulation
In designing the convolutional modulation layer of the model, the convolutional features are used to modulate the output, simplifying the calculation process of the traditional similarity score matrix in the self-attention mechanism [42]. For an embedding matrix \(E\) (\(E\in {\mathbb{R}}^{L\times {d}_{m}}\)) of a given RNA sequence of length \(L\), the convolution modulation output \(O\) is calculated using deep convolution and the Hadamard product. The calculation process is as follows:
where \({W}_{1}\) and \({W}_{2}\) are the weight matrices of two linear layers, \(\bigodot\) representing the Hadamard product, and \(D{C}_{k}\) denotes the depthwise convolution with a kernel size of \(k\). After this operation, each nucleotide in the sequence becomes associated with all nucleotides within a region of length \(k\) centered around itself. The output of each sequence position is a weighted aggregation of all nucleotides in the region, which facilitates an effective understanding of the local context of the RNA sequence.
Multi-head attention mechanism
The multi-head attention mechanism is the core of the Transformer model, enabling the association of all relevant tokens to better encode each word in the input sequence [43]. It is widely used and performs exceptionally well in the field of bioinformatics. The multi-head attention mechanism consists of multiple self-attention mechanisms, and its primary function is to capture different types of internal correlations within the same sequence. In the self-attention mechanism, the embedding vector of each input token is first multiplied by learnable parameter matrices \({W}^{Q}\), \({W}^{K}\), and \({W}^{V}\) to generate the query vector \(Q\), key vector \(K\), and value vector \(V\). Next, the query vector \(Q\) is multiplied by all the key vectors \(K\) using a dot product to measure their similarity. To avoid the gradient vanishing problem of the \(softmax\) function due to excessive dot product values, the dot product value is divided by \(\sqrt{{d}_{k}}\), where \({d}_{k}\) is the dimension of the key vector. Then, the scaled dot product values are fed into the \(softmax\) function to obtain the weights for these values. Finally, the attention mechanism uses these weights to perform a weighted sum of the value vectors \(V\), generating an output vector that contains information from the entire sequence. The mathematical formula for this process is as follows:
When applying the multi-head attention mechanism, \(n\) different attention heads are utilized. After calculating the attention output for each head, multiple distinct attention results are obtained. These outputs are then concatenated and passed through another linear transformation matrix \({W}^{O}\) to generate the final output. The specific calculation formula is as follows:
where \(n\) represents the number of heads, \({\text{head}}_{i}=\text{Attention}\left({Q}_{i},{K}_{i},{V}_{i}\right)\), and \({W}^{O}\) is the learnable linear transformation matrix used to combine the outputs of all the heads.
Prediction module
The output of the encoding module is an \(L\times {d}_{m}\) feature matrix, where L represents the sequence length. In this feature matrix, each row represents the contextual vector of the corresponding nucleotide at that position. We apply average pooling across all rows to obtain a one-dimensional global representation vector for the RNA sequence. This global representation vector is then fed into the prediction module. The prediction module mainly consists of fully connected layers and nonlinear activation functions. Through this module, the features are ultimately mapped to a two-dimensional space for classification tasks to determine the probability that the central site of the RNA sequence is a modification site. The mathematical formula for the prediction layer is as follows:
where \(m\) represents the number of layers in the fully connected network, \({W}_{d}\) represents the weight matrix of the current layer, and \({X}_{l-1}\) represents the output of the previous layer. \({X}_{m}\) is the output of the last fully connected layer, while \({X}_{0}\) is the output of the encoder layer. In addition, \(i\) denotes the sample index, \(j\) denotes the class index, and \(C\) is the number of classes. In this study, \(C\) is equal to 2. \({P}_{ij}\) represents the probability that the sample \(i\) belongs to class \(j\), with values ranging from 0 to 1. For classification, a threshold of 0.5 is applied to the predicted probabilities, where if \({P}_{ij}\) for class 1 is greater than or equal to 0.5, the sample is classified as a modification site; otherwise, it is classified as a non-modification site.
Performance evaluation and optimization
To obtain a better predictive model, we employed fivefold CV and evaluated its performance using an independent test dataset (hereafter referred to as independent testing). The performance of models was quantified using four evaluation metrics: sensitivity (Sn), specificity (Sp), accuracy (ACC), and Matthews correlation coefficient (MCC) [15, 44, 45]. The relevant calculation formulas are as follows:
where TP represents true positives, FP represents false positives, TN represents true negatives, and FN represents false negatives. Sn measures the classifier's ability to correctly identify positive samples, while Sp measures the classifier's ability to correctly identify negative samples. ACC is used to evaluate the performance of the classification model on the test dataset. MCC measures the correlation between the predicted results and the results, with a range of \(\left[-1,+1\right]\). In addition, Area under the receiver-operator characteristic (ROC) curve (AUC) was used to evaluate the model's performance. The AUC ranges from 0 to 1, with values closer to 1 indicating better predictive performance.
In the development of deep learning models, hyperparameter optimization plays a crucial role. A fivefold CV was used to evaluate the effectiveness of hyperparameter optimization, specifically using ACC as the evaluation metric. The optimization ranges for some important hyperparameters are detailed in Additional file2: Table S1. Additionally, we optimized the size of the convolutional kernels, and the details of the optimized parameters are shown in Additional file2: Table S2. In this study, a kernel size of 3 was chosen as the configuration for convolution modulation. All experiments were conducted on a single NVIDIA RTX 4090 GPU. Detailed computational efficiency metrics of YModPred are provided in Additional file2: Table S3.
Data availability
All data generated or analysed during this study are included in this published article, its supplementary information files and publicly available repositories, which are available in the Zenodo repository (https://zenodo.org/records/16442129) and the GitHub repository (https://github.com/aochunyan123/YModPred.git).
Abbreviations
- m6A:
-
N6-methyladenosine
- m5C:
-
5-Methylcytosine
- m1A:
-
N1-methyladenosine
- Ψ:
-
Pseudouridine
- f5C:
-
5-Formylcytosine
- m5U:
-
5-Methyluridine
- m1G:
-
N1-methylguanosine
- m2G:
-
N2-methylguanosine
- m2,2G:
-
N2,N2-dimethylguanosine
- D:
-
Dihydrouridine
- A-to-I:
-
Adenosine-to-inosine
- SVM:
-
Support vector machine
- CNN:
-
Convolutional neural network
- XGBoost:
-
Extreme Gradient Boosting
- SOTA:
-
State-of-the-art
- PseKNC:
-
Pseudo K-tuple nucleotide composition
- NAC:
-
Nucleotide acid composition
- ANF:
-
Accumulated nucleotide frequency
- CKSNAP:
-
Composition of k-spaced nucleic acid pairs
- ASDC:
-
Adaptive skip dinucleotide composition
- DAC:
-
Dinucleotide-based auto covariance
- PseDNC:
-
Pseudo dinucleotide composition
- LGBM:
-
Light gradient boosting machines
- LR:
-
Logistic regression
- RF:
-
Random forest
- t-SNE:
-
T-distributed stochastic neighbor embedding
- ISM:
-
In silico mutagenesis
- fivefold CV:
-
Five-fold cross-validation
- Sn:
-
Sensitivity
- Sp:
-
Specificity
- ACC:
-
Accuracy
- MCC:
-
Matthews correlation coefficient
- AUC:
-
Area under the receiver-operator characteristic (ROC) curve
References
Carlile TM, Rojas-Duran MF, Gilbert WV: Pseudo-Seq: genome-wide detection of pseudouridine modifications in RNA. In: Rna modification. Edited by He C. 2015;560:219–245.
Roundtree IA, Evans ME, Pan T, He C. Dynamic RNA modifications in gene expression regulation. Cell. 2017;169(7):1187–200.
Jaciuk M, Scherf D, Kaszuba K, Gaik M, Rau A, Koscielniak A, Krutyholowa R, Rawski M, Indyka P, Graziadei A, et al. Cryo-EM structure of the fully assembled Elongator complex. Nucleic Acids Res. 2023;51(5):2011–32.
Frye M, Jaffrey SR, Pan T, Rechavi G, Suzuki T. RNA modifications: what have we learned and where are we headed? Nat Rev Genet. 2016;17(6):365–72.
Chen KQ, Song BW, Tang YJ, Wei Z, Xu QR, Su JL, de Magalhaes JP, Rigden DJ, Meng J. RMDisease: a database of genetic variants that affect RNA modifications, with implications for epitranscriptome pathogenesis. Nucleic Acids Res. 2021;49(D1):D1396–404.
Zhang Y, Jiang J, Ma J, Wei Z, Wang Y, Song B, Meng J, Jia G, de Magalhaes JP, Rigden DJ, et al. DirectRMDB: a database of post-transcriptional RNA modifications unveiled from direct RNA sequencing technology. Nucleic Acids Res. 2023;51(D1):D106–16.
Yue Y, Liu J, He C. RNA n-6-methyladenosine methylation in post-transcriptional gene expression regulation. Genes Dev. 2015;29(13):1343–55.
Acera Mateos P, J Sethi A, Ravindran A, Srivastava A, Woodward K, Mahmud S, Kanchi M, Guarnacci M, Xu J, W S Yuen Z et al. Prediction of m6A and m5C at single-molecule resolution reveals a transcriptome-wide co-occurrence of RNA modifications. Nat Commun. 2024;15(1):3899.
Delaunay S, Helm M, Frye M. RNA modifications in physiology and disease: towards clinical applications. Nat Rev Genet. 2024;25(2):104–22.
Ke S, Alemu EA, Mertens C, Gantman EC, Fak JJ, Mele A, et al. A majority of m6A residues are in the last exons, allowing the potential for 3′ UTR regulation. Genes Dev. 2015;29(19):2037–53.
Dominissini D, Nachtergaele S, Moshitch-Moshkovitz S, Peer E, Kol N, Ben-Haim MS, Dai Q, Di Segni A, Salmon-Divon M, Clark WC. The dynamic N 1-methyladenosine methylome in eukaryotic messenger RNA. Nature. 2016;530(7591):441–6.
Edelheit S, Schwartz S, Mumbach MR, Wurtzel O, Sorek R. Transcriptome-wide mapping of 5-methylcytidine RNA modifications in bacteria, archaea, and yeast reveals m5C within archaeal mRNAs. PLoS Genet. 2013;9(6):e1003602.
Lovejoy AF, Riordan DP, Brown PO. Transcriptome-wide mapping of pseudouridines: pseudouridine synthases modify specific mRNAs in S. cerevisiae. PLoS One. 2014;9(10):e110799.
Khoddami V, Yerra A, Mosbruger TL, Fleming AM, Burrows CJ, Cairns BR. Transcriptome-wide profiling of multiple RNA modifications simultaneously at single-base resolution. Proc Natl Acad Sci U S A. 2019;116(14):6784–9.
Wang Y, Zhai Y, Ding Y, Zou Q. SBSM-Pro: support bio-sequence machine for proteins. Sci China Inf Sci. 2024;67(11):212106.
Feng P, Ding H, Yang H, Chen W, Lin H, Chou K-C. iRNA-psecoll: identifying the occurrence sites of different RNA modifications by incorporating collective effects of nucleotides into PseKNC. Molecular Therapy-Nucleic Acids. 2017;7:155–63.
Chen W, Feng P, Yang H, Ding H, Lin H, Chou K-C. iRNA-3typeA: identifying three types of modification at RNA’s adenosine sites. Molecular Therapy-Nucleic Acids. 2018;11:468–74.
Liu K, Chen W. IMRM: a platform for simultaneously identifying multiple kinds of RNA modifications. Bioinformatics. 2020;36(11):3336–42.
Chen Z, Zhao P, Li F, Wang Y, Smith AI, Webb GI, et al. Comprehensive review and assessment of computational methods for predicting RNA post-transcriptional modification sites from RNA sequences. Brief Bioinform. 2019;21(5):1676–96.
Tahir M, Hayat M, Chong KT. A convolution neural network-based computational model to identify the occurrence sites of various RNA modifications by fusing varied features. Chemom Intell Lab Syst. 2021;211:104233.
Song Z, Huang D, Song B, Chen K, Song Y, Liu G, et al. Attention-based multi-label neural networks for integrated prediction and interpretation of twelve widely occurring RNA modifications. Nat Commun. 2021;12(1):4011.
Liang S, Zhao Y, Jin J, Qiao J, Wang D, Wang Y, et al. Rm-lr: a long-range-based deep learning model for predicting multiple types of RNA modifications. Comput Biol Med. 2023;164:107238.
Chen Z, Zhao P, Li C, Li F, Xiang D, Chen Y-Z, et al. Ilearnplus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization. Nucleic Acids Res. 2021;49(10):e60–e60.
Ma J, Zhang L, Chen J, Song B, Zang C, Liu H. M7GDisAI: N7-methylguanosine (m7G) sites and diseases associations inference based on heterogeneous network. BMC Bioinformatics. 2021;22:1–16.
Liu B. Bioseq-analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches. Brief Bioinform. 2019;20(4):1280–94.
Wei L, Zhou C, Chen H, Song J, Su R. ACpred-fl: a sequence-based predictor using effective feature representation to improve the prediction of anti-cancer peptides. Bioinformatics. 2018;34(23):4007–16.
Liu B, Liu F, Fang L, Wang X, Chou K-C. Repdna: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects. Bioinformatics. 2014;31(8):1307–9.
Bailey TL. Streme: accurate and versatile sequence motif discovery. Bioinformatics. 2021;37(18):2834–40.
Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS. Quantifying similarity between motifs. Genome Biol. 2007;8(2):1–9.
Nair S, Shrikumar A, Schreiber J, Kundaje A. FastISM: performant in silico saturation mutagenesis for convolutional neural networks. Bioinformatics. 2022;38(9):2397–403.
Lv H, Zhang Z-M, Li S-H, Tan J-X, Chen W, Lin H. Evaluation of different computational methods on 5-methylcytosine sites identification. Brief Bioinform. 2020;21(3):982–95.
Zhang X, Wang S, Xie L, Zhu Y. PseU-ST: a new stacked ensemble-learning method for identifying RNA pseudouridine sites. Front Genet. 2023. https://doi.org/10.3389/fgene.2023.1121694.
Wang Y, Wang X, Cui X, Meng J, Rong R. Self-attention enabled deep learning of dihydrouridine (D) modification on mrnas unveiled a distinct sequence signature from trnas. Molecular Therapy - Nucleic Acids. 2023;31:411–20.
Feng P, Chen W. iRNA-m5U: a sequence based predictor for identifying 5-methyluridine modification sites in Saccharomyces cerevisiae. Methods. 2022;203:28–31.
Song B, Wang X, Liang Z, Ma J, Huang D, Wang Y, de Magalhães JP, Rigden DJ, Meng J, Liu G, et al. RMDisease V2.0: an updated database of genetic variants that affect RNA modifications with disease and trait implication. Nucleic Acids Res. 2022;51(D1):D1388–96.
Xuan J-J, Sun W-J, Lin P-H, Zhou K-R, Liu S, Zheng L-L, et al. RMbase v2.0: deciphering the map of RNA modifications from epitranscriptome sequencing data. Nucleic Acids Res. 2017;46(D1):D327-34.
Song BW, Tang YJ, Chen KQ, Wei Z, Rong R, Lu ZL, Su JL, de Magalhaes JP, Rigden DJ, Meng J. m7GHub: deciphering the location, regulation and pathogenesis of internal mRNA N7-methylguanosine (m(7)G) sites in human. Bioinformatics. 2020;36(11):3528–36.
Wang H, Zhao S, Cheng Y, Bi S, Zhu X. MTdeepm6a-2S: a two-stage multi-task deep learning method for predicting RNA N6-methyladenosine sites of Saccharomyces cerevisiae. Front Microbiol. 2022;13:999506.
Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28(23):3150–2.
Vaswani A. Attention is all you need. Advances in Neural Information Processing Systems. 2017.
Yang M, Huang L, Huang H, Tang H, Zhang N, Yang H, et al. Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution. Nucleic Acids Res. 2022;50(14):e81–e81.
Hou Q, Lu CZ, Cheng MM, Feng J. Conv2former: a simple transformer-style convnet for visual recognition. IEEE Trans Pattern Anal Mach Intell. 2024;46(12):8274–83.
Lin Z, Feng M, Santos CNd, Yu M, Xiang B, Zhou B, Bengio Y. A structured self-attentive sentence embedding. arXiv preprint arXiv:170303130. 2017.
Fu X, Yuan Y, Qiu H, Suo H, Song Y, Li A, Zhang Y, Xiao C, Li Y, Dou L, et al. AGF-ppis: a protein–protein interaction site predictor based on an attention mechanism and graph convolutional networks. Methods. 2024;222:142–51.
Li Y, Wei X, Yang Q, Xiong A, Li X, Zou Q, et al. Msbert-promoter: a multi-scale ensemble predictor based on BERT pre-trained model for the two-stage prediction of DNA promoters and their strengths. BMC Biol. 2024;22(1):126.
Acknowledgements
Not applicable.
Funding
This work was supported by the National Science and Technology Major Project (2022ZD0117700), Natural Science Foundation of China (No. 62450002, 62472344, 62202081, and 32270786) and Sichuan Tianfu Emei Plan. Shenzhen Science and Technology Program (Grant No. RCBS20231211090800004). Zhejiang Provincial Natural Science Foundation of China (No. LD24F020004), and the Municipal Government of Quzhou (No.2023D036).
Author information
Authors and Affiliations
Contributions
Y.W. and L.Y. conceived and designed the experiment. C.A. and M.N. performed the experiment. C.A. analyzed the results. C.A. and M.N. wrote and revised the manuscript. Q.Z., L.Y. and Y.W. approved the final version of the manuscript. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
12915_2025_2372_MOESM2_ESM.docx
Additional file2: Table S1. Hyperparameter search range. Table S2.Convolution kernel parameter optimization. Table S3. Computational efficiency metrics of YModPred.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Ao, C., Niu, M., Zou, Q. et al. YModPred: an interpretable prediction method for multi-type RNA modification sites in S. cerevisiae based on deep learning. BMC Biol 23, 272 (2025). https://doi.org/10.1186/s12915-025-02372-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12915-025-02372-y