Abstract
Malonylation modification of proteins is closely related to many diseases, such as diabetes and cancer. Therefore, accurate identification of malonylation modification sites is crucial for elucidating the molecular mechanisms underlying these diseases. Traditional experimental methods suffer from the problems of high cost, long cycle time, difficulty, etc. With advancements in artificial intelligence, the prediction of protein post-translational modification sites through computational methods has emerged as a vital complement to experimental approaches. In this paper, we present a malonylation site prediction model, Catsoft_Kmalsite, the core innovation of which lies in its integration of complementary information from protein three-dimensional structural features and sequence/physicochemical features, coupled with a soft voting ensemble strategy based on Bayesian-optimized base classifiers. Specifically, we utilize AlphaFold2 to acquire protein tertiary structural information and employ CTDC, EAAC, and EGAAC methods to extract protein sequence and physicochemical features. Subsequently, two base classifiers are constructed using the CatBoost algorithm based on these two distinct feature sets, respectively. Following parameter fine-tuning of the base classifiers via Bayesian optimization, they are ultimately integrated using a soft voting strategy. All ablation experimental results show that the Catsoft_Kmalsite model exhibited good robustness and generalization ability. Across six metrics, including AUC, ACC, Sen, Pre, F1, and MCC, the model achieved average performances of 94.03%, 87.91%, 89.15%, 86.91%, 88.00%, and 0.7585, respectively, in fivefold cross-validation and specific performance of 95.18%, 89.55%, 90.87%, 88.79%, 89.82%, and 0.7912 on the independent test set; Catsoft_Kmalsite also outperformed other state-of-the-art studies in all evaluated metrics. Furthermore, we have developed a website for users to use (http://1.94.102.146:8501/Catsoft_Kmalsite). The code and dataset of Catsoft_Kmalsite are available at https://github.com/flyinsky6/Catsoft_Kmalsite.
Graphical abstract
Similar content being viewed by others
Data availability
The raw data is derived from the CPLM database (https://cplm.biocuckoo.cn/index.php). The process-related data and code for the paper are uploaded to https://github.com/flyinsky6/Catsoft_Kmalsite.
References
Vu LD, Gevaert K, Smet ID (2018) Protein language: post-translational modifications talking to each other. Trends Plant Sci 23(12):1068–1080. https://doi.org/10.1016/j.tplants.2018.09.004
Walsh CT, Garneau-Tsodikova S, Gatto GJ Jr (2005) Protein posttranslational modifications: the chemistry of proteome diversifications. Angew Chem Int Ed 44(45):7342–7372. https://doi.org/10.1002/anie.200501023
Brian J, Charlie F et al (2020) Light-driven post-translational installation of reactive protein side chains. Nature 585(7826):530–537. https://doi.org/10.1038/s41586-020-2733-7
Snider NT, Omary MB (2014) Post-translational modifications of intermediate filament proteins: mechanisms and functions. Nat Rev Mol Cell Biol 15(3):163–177. https://doi.org/10.1038/nrm3753
Peng C, Lu Z, Xie Z et al (2011) The first identification of lysine malonylation substrates and its regulatory enzyme. Mol Cell Proteomics 10(12):M111 012658. https://doi.org/10.1074/mcp.M111.012658
Zhongyu X, Junbiao D, Lunzhi D et al (2012) Lysine succinylation and lysine malonylation in histones. Mol & Cell Proteomics 11(5):100–107. https://doi.org/10.1074/mcp.M111.015875
Yipeng D, Tanxi C, Tingting L et al (2015) Lysine malonylation is elevated in type 2 diabetic mouse models and enriched in metabolic associated proteins. Mol & Cell Proteomics 14(1):227–236. https://doi.org/10.1074/mcp.M114.041947
Gozde C, Olga P, Lunzhi D et al (2015) Proteomic and biochemical studies of lysine malonylation suggest its malonic aciduria-associated regulatory role in mitochondrial function and fatty acid oxidation. Mol Cell Proteomics 14(11):3056–3071. https://doi.org/10.1074/mcp.M115.048850
Nishida Y, Rardin JM, Carrico C et al (2015) SIRT5 regulates both cytosolic and mitochondrial protein malonylation with glycolysis as a major target. Mol Cell 59(2):321–332. https://doi.org/10.1016/j.molcel.2015.05.022
Nie LB, Liang QL, Du R et al (2020) Global proteomic analysis of lysine malonylation in Toxoplasma gondii. Front Microbiol 11:776. https://doi.org/10.3389/fmicb.2020.00776
Ramazi S, Tabatabaei HAS, Khalili E et al (2024) Analysis and review of techniques and tools based on machine learning and deep learning for prediction of lysine malonylation sites in protein sequences. Database: J Biol Databases Curation. https://doi.org/10.1093/database/baad094
Yan X, Ya-Xin D, Jun D et al (2016) Mal-Lys: prediction of lysine malonylation sites in proteins integrated sequence-based features with mRMR feature selection. Sci Rep. https://doi.org/10.1038/srep38318
Li-Na W, Shao-Ping S, Hao-Dong X et al (2017) Computational prediction of species-specific malonylation sites via enhanced characteristic strategy. Bioinformatics (Oxford, England) 33(10):1457–1463. https://doi.org/10.1093/bioinformatics/btw755
Ghazaleh T, Yuedong Y, Haodong X et al (2018) Predicting lysine-malonylation sites of proteins using sequence and predicted structural features. J Computat Chem 39(22):1757–1763. https://doi.org/10.1002/jcc.25353
Xin L, Liang W, Jian L et al (2020) Mal-Prec: computational prediction of protein Malonylation sites via machine learning based feature integration. BMC Genom 21(1):812. https://doi.org/10.1186/s12864-020-07166-w
Ahmad W, Arafat E, Taherzadeh G et al (2020) Mal-Light: enhancing lysine malonylation sites prediction problem using evolutionary-based features. IEEE Access. https://doi.org/10.1109/ACCESS.2020.2989713
Dipta SR, Ahmad W, Arafat ME et al (2020) SEMal: accurate protein malonylation site predictor using structural and evolutionary information. Comput Biol Med. https://doi.org/10.1016/j.compbiomed.2020.104022
Ghanbari SA, Jamshid P, Vahid G (2022) A hybrid feature extraction scheme for efficient malonylation site prediction. Sci Rep 12(1):5756. https://doi.org/10.1038/s41598-022-08555-9
Zhang Y, Xie R, Wang J, Leier A, Marquez-Lago TT, Akutsu T, Webb GI, Chou KC, Song J (2019) Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an integrative machine-learning framework. Briefings Bioinform 20(6):2185–2199. https://doi.org/10.1093/bib/bby079
Chen Z, He N, Huang Y et al (2018) Integration of a deep learning classifier with a random forest approach for predicting Malonylation sites. Genomics Proteomics Bioinform 16(06):451–459. https://doi.org/10.1016/j.gpb.2018.08.004
Minghui W, Xiaowen C, Shan L et al (2020) DeepMal: accurate prediction of protein malonylation sites by deep neural networks. Chemomet Intelligent Lab Syst 207:104175. https://doi.org/10.1016/j.chemolab.2020.104175
Al-Barakati H, Thapa N, Hiroto S et al (2020) RF-MaloSite and DL-Malosite: Methods based on random forest and deep learning to identify malonylation sites. Computat Struct Biotechnol J. https://doi.org/10.1016/j.csbj.2020.02.012
Minghui W, Lili S, Yaqun Z et al (2022) Malsite-Deep: prediction of protein malonylation sites through deep learning and multi-information fusion based on NearMiss-2 strategy. Knowledge-Based Syst 240:108191. https://doi.org/10.1016/j.knosys.2022.108191
Sultan FM, Shaon HSM, Karim T et al (2024) MLAFP-XN: leveraging neural network model for development of antifungal peptide identification tool. Heliyon 10(18):e37820–e37820. https://doi.org/10.1016/j.heliyon.2024.e37820
John J, Richard E, Alexander P et al (2021) Highly accurate protein structure prediction with AlphaFold. Nature 596(7873):583–589. https://doi.org/10.1038/s41586-021-03819-2
Sultan FM, Karim T, Shaon HSM et al (2025) DHUpredET: a comparative computational approach for identification of dihydrouridine modification sites in RNA sequence. Analytical Biochem 702:702115828. https://doi.org/10.1016/j.ab.2025.115828
Swiss-Model: a web-based computational tool for designing of protein structures. Biotech Today, 2021, 10(2):43–44.https://doi.org/10.5958/2322-0996.2020.00025.3
Prokhorenkova L, Gusev G, Vorobev A, et al. CatBoost: unbiased boosting with categorical features, 2017. https://doi.org/10.48550/arXiv.1706.09516
Zhang W, Tan X, Lin S et al (2022) CPLM 4.0: an updated database with rich annotations for protein lysine modifications. Nucleic Acids Res 50(D1):D451–D459. https://doi.org/10.1093/nar/gkab849
Vacic V, Iakoucheva LM, Radivojac P (2006) Two sample logo: a graphical representation of the differences between two sets of sequence alignments. Bioinformatics 22(12):1536–1537. https://doi.org/10.1093/bioinformatics/btl151
Zhen C, Pei Z, Chen L et al (2021) Ilearnplus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization. Nucleic Acids Res. https://doi.org/10.1093/nar/gkab122
Dietterich T G. 2000 Ensemble Methods in Machine Learning. proc international workshgp on multiple classifier systems. https://doi.org/10.1007/3-540-45014-9_1
Valentini G, Masulli F. 2002 Ensembles of Learning Machines. Springer-Verlag. https://doi.org/10.1007/3-540-45808-5_1
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):37–45. https://doi.org/10.1162/neco.1997.9.8.1735
Wei Y et al (2024) Editorial: special section on challenges and opportunities in biomedical big data analysis: from large language models to clinical applications. Big Data Min Anal 7(4):1114–1115. https://doi.org/10.26599/BDMA.2024.9020077
Funding
Jiangsu Training Program of Innovation and Entrepreneurship for Undergraduates, 202410313030Z, General Project of the Natural Science Foundation of Jiangsu Higher Education Institutions of China, 22KJB520040, the Natural Science Research of Jiangsu Higher Education Institutions of China, 22KJB310021, and Postdoctoral Science Foundation of Jiangsu Province, 1701062B, 2017107001, Joint Project of Industry University Research of Jiangsu Province, BY20230198.
Author information
Authors and Affiliations
Contributions
LLX and YTQ equally contributed to the literature review and initial framework construction. JYY, XWX, ZQL, and YHW participated in data collection and pre—processing. EHL and XXK were responsible for the technical verification and algorithm optimization of the model. Hwz and YPL provided support in software development and data analysis tools. FW contributed industry—relevant insights and application—scenario suggestions. XL supervised the whole project, guided the research direction, and participated in the final paper revision and approval.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xu, L., Qian, Y., Yang, J. et al. Enhancing the identification of malonylation sites using AlphaFold2 and ensemble learning. Mol Divers (2025). https://doi.org/10.1007/s11030-025-11357-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11030-025-11357-6