- Research
- Open access
- Published:
Advancing virulence factor prediction using protein language models
BMC Biology volume 23, Article number: 307 (2025)
Abstract
Background
Bacterial infections rank as the second leading cause of death globally, with virulence factors (VFs) being crucial to their pathogenicity. Predicting VFs accurately can uncover mechanisms of bacterial diseases and suggest new treatments. Current machine learning (ML) methods face challenges, such as outdated feature extraction, simplistic forecasting frameworks, and lack of differentiation between gram-positive (G +) and gram-negative (G −) bacteria.
Results
In this study, we introduced pLM4VF, a predictive framework that utilized ESM protein language models to extract VF characteristics of G + and G − bacteria separately, and further integrated the models using the stacking strategy. Extensive benchmarking experiments on the independent test demonstrated that pLM4VF outperformed state-of-the-art methods, exhibiting improved accuracy by 0.088–0.320 and 0.063–0.307 for VF prediction of G + and G − bacteria, respectively. Biological validations through cytotoxicity and acute toxicity assays further corroborated the reliability of pLM4VF. Additionally, an online tool (https://compbiolab.hainanu.edu.cn) has been developed that enables inexperienced researchers on ML to obtain VFs of various bacteria at the whole-genome scale.
Conclusions
We believe that pLM4VF will offer substantial support in uncovering pathogenic mechanisms, developing novel antibacterial treatments and vaccines, thereby aiding in the prevention and management of bacterial diseases.
Background
Bacterial infections continue to be a significant public health issue faced by all countries, particularly developing countries. They pose a Major threat to global health and rank as the second leading cause of death after ischemic heart disease. In 2019 alone, the number of deaths caused by bacterial infections reached 13.7 million [1]. Virulence factors (VFs) play a crucial role in the process of bacterial infection, facilitating invasion and colonization, evasion of the host immune response, acquisition of nutrients, and induction of host tissue damage and inflammation [2, 3]. Anti-virulence strategies targeting VFs offer effective approaches in treating bacterial infections. They could prevent bacterial pathogenesis without killing the bacteria, thereby reducing the selection pressure for bacterial resistance [4]. However, the reliance on experimental methodologies is impeded by their labor-intensive and time-consuming trait, resulting in a significantly lower number of known VFs in comparison to their actual prevalence within pathogenic bacteria. In particular, the advent of high-throughput sequencing technologies has facilitated the accumulation of vast amounts of genomic data from pathogenic bacteria [5]. As a result, it is urgent to develop efficient and accurate methodologies for identifying VFs within these genomes, which can accelerate our understanding of the mechanisms underlying bacterial pathogenicity and aid in the treatment of bacterial infections.
Numerous machine learning (ML) methods have been proposed for VF prediction, such as SPAAN, VirulentPred, MP3, and DeepVF. SPAAN, an artificial neural network-based approach, focused on predicting adhesion factors, a specific type of VF [6]. It integrated five distinct features, including amino acid composition (AAC), multiplet frequency, dipeptide composition (DPC), charge composition, and hydrophobic composition. On the other hand, VirulentPred, a support vector machine (SVM)-based model, was designed to predict a broader range of VFs [7]. Except for AAC and DPC, VirulentPred utilized position-specific scoring matrices (PSSM) profiles generated through PSI-BLAST as part of its training features. MP3, another predictive model of VFs, combined SVM and hidden Markov model (HMM). Its training features included AAC, DPC, and Pfam domains [8]. DeepVF, a hybrid framework for VF prediction, integrated four ML algorithms and three deep learning (DL) algorithms [9]. It was found that DL algorithms are less effective compared to traditional ML algorithms. The negative samples used in DeepVF were extracted from PBVF [10], constructed using a consistent and generalizable strategy for negative sample selection. While these methods have provided foundational insights, their inherent limitations impede practical applications in bacterial research and therapy. First, handcrafted feature representations often fail to capture complex functional patterns, leading to reduced sensitivity in identifying evolutionarily divergent VFs. This directly impacts pathogen surveillance, where false negatives may obscure emerging threats. Second, models reliant on static features exhibit poor cross-species generalizability due to training data bias—a shortcoming that can mislead therapeutic target identification in neglected bacterial pathogens. These constraints collectively hinder rapid response to outbreaks and rational drug design. Moreover, the variations in cell wall structure and composition between gram-positive (G +) and gram-negative (G −) bacteria give rise to distinct characteristics in their VFs [11]. Regrettably, these methods have not taken the differences into account when constructing predictive models.
Recently, inspired by the success of natural language processing, a series of protein language models (pLMs) have emerged. pLMs conceptualize the protein sequence as a language, with each amino acid regarded as a character. Compared to traditional descriptors, pLMs are pretrained on large-scale protein sequence datasets spanning the evolutionary tree of life to capture the intricate patterns and relationships within protein sequences [12]. Evolutionary scale modeling (ESM) models, including ESM-1 [13] and ESM-1b [14] released in 2019, as well as ESM-2 [12] in 2022, stand at the forefront of pLM advancements. They were developed upon the self-supervised learning method known as bidirectional encoder representations from transformers (BERT) and trained on the UniRef50 dataset. UniRef50, a clustered version of the UniProt Knowledgebase, groups protein sequences with ≥ 50% identity to reduce redundancy while maintaining evolutionary diversity [15]. This carefully curated dataset has become a gold standard for training large-scale pLMs due to its balanced representation of sequence space. Leveraging BERT’s bidirectional encoding capability and the advantages of transformer architecture, ESM pLMs can also acquire multi-scale representations of proteins encompassing biochemical properties, secondary and tertiary structures, and inherent functional patterns [14, 16]. It has been demonstrated that ESM pLMs deliver exceptional performance in various protein prediction tasks, such as identifying TCR-peptide-MHC binding sites [17], pinpointing protein variant effects [18] and predicting drug-target interaction [19]. However, ESM pLMs have not yet been applied to VF prediction.
In this study, we introduced pLM4VF for predicting VFs of G + and G − bacteria (Fig. 1). Results from tenfold cross-validation and independent tests demonstrated that ESM pLMs were more adept at capturing VF characteristics compared to traditional descriptors. Subsequently, based on the most effective ESM pLMs, ensemble models for G + and G − bacteria were separately trained using the stacking strategy. The top-performing ensemble models for both types of bacteria collectively constituted pLM4VF, which significantly outperformed state-of-the-art methods, as evidenced by cytotoxicity and acute toxicity assays. Notably, when using pLM4VF for cross-prediction—applying the G + bacteria model to predict VFs of G − bacteria and the G − bacteria model to predict VFs of G + bacteria—the model performance significantly declined, highlighting the necessity of constructing separate models for G + and G − bacteria. Additionally, a user-friendly online tool was developed to facilitate access to pLM4VF, enabling comprehensive VF exploration at a genome-wide scale.
Results
Superior predictive performance of ESM pLMs over traditional descriptors in predicting VFs
Protein embeddings enable ML models to extract meaningful features and patterns by converting the raw protein sequence into a numerical representation, which in turn significantly impacts the final model performance. To identify the most suitable sequence representation approaches for VF prediction, we conducted a comprehensive comparison of 8 ESM pLMs from ESM-1, ESM-1b, and ESM-2 (Additional file 1: Table S1), along with 11 traditional descriptors (Fig. 2). These descriptors encompassed four sequence-based, four evolutionary information-based, two physicochemical property-based, and the sequence similarity (SEQSIM) features. Additionally, six ML algorithms, including K-nearest neighbor (KNN), SVM, random forest (RF), eXtreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), and Categorical Boosting (CatBoost), were utilized to train models.
Performance comparison of traditional descriptors and ESM pLMs. Eleven traditional descriptors and 8 evolutionary scale modeling (ESM) protein language models (pLMs) were individually trained using 6 machine learning algorithms. The accuracy (ACC) values of these models were evaluated and compared for A gram-positive and B gram-negative bacteria on the tenfold cross-validation test
The results of tenfold cross-validation revealed that the ESM pLMs consistently outperformed the traditional descriptors in predicting VFs for both G + and G − bacteria (Fig. 2; Additional file 1: Tables S2-S5). Among them, the esm2_t33_650M_UR50D pLM (hereafter abbreviated as ESM-2-650 M) exhibited superior suitability for VF prediction of G + bacteria, with the SVM model trained on this pLM demonstrating outstanding performance compared to all other models of G + bacteria (Fig. 2A). It achieved a sensitivity (SN) of 0.781, specificity (SP) of 0.801, accuracy (ACC) of 0.762, F1 score of 0.748, Matthew’s correlation coefficient (MCC) of 0.527, and AUC 0.830 (Additional file 1: Table S2). Similarly, the esm1b_t33_650M_UR50S pLM (hereafter abbreviated as ESM-1b) showcased remarkable performance in predicting VFs of G − bacteria (Fig. 2B), with SVM retaining its superiority and achieving a SN of 0.842, SP of 0.851, ACC of 0.822, F1 score of 0.816, MCC of 0.645, and AUC 0.888 (Additional file 1: Table S3). For a more rigorous comparison, the 11 traditional descriptors were concatenated into a 2126-dimensional feature vector referred to as “EleTra.” Compared to individual traditional descriptors, “EleTra” notably enhanced the predictive performance of models, particularly when trained with the XGBoost algorithm (Additional file 1: Tables S4, S5). However, it still fell short compared to ESM-1b for VF prediction of G − bacteria.
Performance validation of ESM pLMs on the independent dataset
Although tenfold cross-validation is a common method for initially estimating model performance, it relies on the same dataset for training and validation. Conversely, the independent test dataset comprises unseen data, providing a more precise and unbiased performance evaluation. In this context, ESM-2-650 M and ESM-1b underwent further assessment and comparison with “EleTra” on the independent test, yielding results consistent with those obtained from the tenfold cross-validation. Among the six ML algorithms, SVM consistently demonstrated the best performance for both ESM pLMs (Fig. 3). Meanwhile, XGBoost was found to be more compatible with “EleTra” (Fig. 3). A noticeable improvement of 0.065 in SN, 0.087 in SP, 0.036 in ACC, 0.026 in F1 score, and 0.074 in MCC was observed for the SVM model trained on ESM-2-650 M compared to the XGBoost model trained on “EleTra” for G + bacteria (Fig. 3A). Similar improvements were noted for G − bacteria, with the SVM model trained on ESM-1b showing enhancements of 0.046 in SN, 0.041 in SP, 0.042 in ACC, 0.043 in F1 score, and 0.113 in MCC compared to the XGBoost model trained on “EleTra” (Fig. 3B). These results indicated that ESM pLMs had the ability to capture information that may not be captured by traditional descriptors, such as the interdependencies between amino acids.
Performance comparison of the concatenated traditional descriptor with ESM-2-650 M and ESM-1-b. The “EleTra” descriptor comprised a concatenation of 11 traditional descriptors. esm2_t33_650M_UR50D (ESM-2-650 M), esm1b_t33_650M_UR50S (ESM-1-b), and “EleTra” were individually trained using six Machine learning algorithms. Five metrics, ranging from 0 to 1, were employed to assess the performance of models for A gram-positive and B gram-negative bacteria on the independent test. SN, sensitivity; SP, specificity; ACC, accuracy; MCC, Matthew’s correlation coefficient
To delve deeper into the mechanisms underlying the enhanced predictive performance by ESM pLMs, we analyzed their internal processes of embeddings. We selected a complete protein sequence from G + bacteria and encoded each amino acid using ESM-2-650 M. Subsequently, the Pearson correlation coefficient was calculated for each pair of amino acid embedding vectors. The same analysis was conducted for G − bacteria, with the embedding method replaced by ESM-1b. The results revealed that the majority of amino acids exhibited strong correlations with nearby amino acids, particularly noticeable at both ends of protein sequences (Fig. 4). Moreover, even as the distance increased, the ESM pLMs demonstrated their capability to capture the association between amino acids, implying that ESM pLMs could balance local and long-term interdependencies.
Interpretability analysis of ESM-2-650 M and ESM-1-b. Each amino acid in a complete protein sequence from A gram-positive bacteria was encoded using esm2_t33_650M_UR50D (ESM-2-650 M), with the embedding method replaced by esm1b_t33_650M_UR50S (ESM-1-b) for B gram-negative bacteria. The Pearson correlation coefficient is calculated for each pair of amino acid embedding vectors, with values ranging from 0 to 1. A higher value indicates a stronger correlation between amino acids
After determining the most suitable embeddings for VF prediction of G + and G − bacteria, namely ESM-2-650 M and ESM-1b, we investigated the impact of sampling strategies on model performance. Initial experiments employed the random under-sampling (RUS), which balances class distribution by randomly removing majority-class samples. While RUS is computationally efficient, it risks losing informative data. To mitigate this limitation, we tested the Synthetic Minority Over-sampling Technique (SMOTE), which generates synthetic minority-class samples through feature-space interpolation. However, SMOTE proved unsuitable for our independent testing datasets (Additional file 1: Tables S6, S7). For G + bacteria, SMOTE-based models (KNN, RF, SVM, CatBoost) exhibited perfect SN (1.000) but near-zero SP (0.000–0.020), while XGBoost and LightGBM showed marginally better but still inadequate performance (SN: 0.313 and SP: 0.720 for XGBoost; SN: 0.640 and SP: 0.453 for LightGBM) (Additional file 1: Table S6). Similarly, for G − bacteria, models like KNN/RF/SVM achieved perfect SP (0.990–1.000) but failed to detect VFs (SN: 0.000–0.073), with XGBoost/LightGBM/CatBoost performing poorly (SP: 0.150–0.436) (Additional file 1: Table S7). In contrast, RUS demonstrated a robust balance between SN and SP, achieving consistently high values for both metrics (Additional file 1: Tables S6, S7).
Furthermore, we extended our analysis to evaluate additional CD-HIT sequence clustering thresholds (0.6–0.8), building upon our initial results obtained at the 0.5 threshold. CD-HIT is a widely used bioinformatics tool for reducing sequence redundancy and improving downstream analyses [20]. Our results demonstrated that higher similarity thresholds (0.6–0.8) consistently led to imbalanced SN and SP metrics (Additional file 1: Tables S8, S9). Only at the more stringent 0.5 threshold did we observe optimal and balanced model performance across all evaluation criteria.
pLM4VF, constructed based on ESM pLMs, outperformed state-of-the-art methods
Given the optimal balance of performance metrics achieved by both the RUS approach and the CD-HIT 0.5 sequence clustering threshold, we adopted this combined methodology for base model development. The six ML algorithms (KNN, SVM, RF, XGBoost, LightGBM, and CatBoost) were applied to the most suitable embeddings. A total of six base models were obtained for each bacteria type. Among them, 3, 4, 5, or 6 base models were selected to generate 20 (C(6,3)), 15 (C(6,4)), 6 (C(6,5)), and 1 (C(6,6)) combinations of base models, respectively. The prediction results of each group of base models served as input data for constructing meta models, which were implemented using the LR, SVM, or XGBoost algorithm. This process yielded a total of 126 ensemble models (42 combinations multiplied by three meta models) for each bacteria type (Additional file 1: Tables S10, S11).
Among these ensemble models, the one composed of the KNN, SVM, and RF base models along with the LR meta model demonstrated the most impressive predictive performance for G + bacteria, achieving a SN of 0.802, SP of 0.800, ACC of 0.803, F1 score of 0.804, MCC of 0.607, AUC of 0.807, and AUPRC of 0.795 on the independent test (Table 1 and Fig. 5A). Similarly, for G − bacteria, the ensemble model, which comprised all six base models and the SVM meta model, exhibited outstanding predictive capability, attaining a SN of 0.881, SP of 0.895, ACC of 0.833, F1 score of 0.822, MCC of 0.671, AUC of 0.866, and AUPRC of 0.853 (Table 1 and Fig. 5B). These two top-performing ensemble models collectively formed the VF prediction framework termed “pLM4VF.”
When employing pLM4VF for cross-prediction (Additional file 1: Table S12), namely using the G + bacteria model to predict VFs of G − bacteria, the values of five performance metrics significantly declined, with SN dropping from 0.802 to 0.353, SP from 0.800 to 0.573, ACC from 0.803 to 0.463, F1 score from 0.804 to 0.397, and MCC from 0.607 to 0 (Additional file 1: Table S12). Similar outcomes were observed when the G − bacteria model predicted VFs of G + bacteria: SN decreased from 0.880 to 0.682, SP from 0.895 to 0.672, ACC from 0.833 to 0.677, F1 score from 0.822 to 0.679, and MCC from 0.671 to 0.355 (Additional file 1: Table S12). The results underscored the distinct characteristics of the two types of VFs, affirming the necessity of separately modeling.
To assess the effectiveness of pLM4VF, we compared its predictive capability against three online VF prediction tools, namely MP3 (http://metagenomics.iiserb.ac.in/mp3/index.php), VirulentPred (https://bioinfo.icgeb.res.in/virulent/) and DeepVF (http://deepvf.erc.monash.edu/) on the independent test. Since both pLM4VF and DeepVF utilized the same negative sample dataset, the independent test dataset was adjusted to ensure a fair comparison between them by excluding the data used to train DeepVF. The results showed that pLM4VF surpassed VirulentPred, MP3, and DeepVF (Tables 1 and 2). In comparison to DeepVF, which was the most effective method among the three, pLM4VF exhibited notable improvements in SN, ACC, F1-score, and MCC by 0.259, 0.088, 0.066, and 0.109, respectively, for predicting VFs of G + bacteria (Table 2). Similar results were observed in predicting VFs of G- bacteria. pLM4VF demonstrated significant enhancements in SN, SP, ACC, F1-score, and MCC, with values 0.219, 0.139, 0.063, 0.043, and 0.153 higher than those of DeepVF, respectively (Table 2). To facilitate wider accessibility, pLM4VF has been deployed as a user-friendly online tool, accessible at https://compbiolab.hainanu.edu.cn. This platform empowers researchers and practitioners with a robust tool for bacterial VF prediction.
Biological validation experiments substantiated the predictive capacity of pLM4VF
pLM4VF was further applied to comprehensively investigate the VFs at the whole genome level of Aeromonas veronii C4. A. veronii C4 is prevalent in aquatic environments and has the potential to infect humans, terrestrial animals, and aquatic animals, causing a variety of diseases and, in severe cases, fatalities. Its genome was previously sequenced in our study [21]. The top 100 VFs were identified based on the prediction scores using pLM4VF, followed by functional annotation. The results indicated that these VFs were primarily involved in “transport,” “phosphorelay signal transduction system,” and “regulation of transcription, DNA-template” (Additional file 1: Table S13). Furthermore, sequence alignments were performed between the proteins of A. veronii C4 and known VFs. A total of 1532 A. veronii C4 proteins exhibited at least 30% sequence identity with known VFs. Among these, 1250 proteins were predicted as VFs by pLM4VF, resulting in a recall rate of 81.59%. These findings provided additional evidence confirming the effectiveness of pLM4VF in accurately predicting VFs and contributed to our understanding of the VFs within A. veronii C4.
pLM4VF predicted the proteins AcrA, AcrB, SmpB, and BvgS as VFs, while MP3 and DeepVF identified only AcrA, AcrB, and BvgS as VFs. VirulentPred recognized SmpB and BvgS as VFs (Additional file 1: Table S14). To validate the virulence-associated roles of the four proteins, the corresponding genes were individually knocked out from the wild-type (WT) strains of A. veronii C4. These knockout strains were subsequently used to infect mouse macrophages (RAW 264.7). In comparison to the WT strain, mouse macrophages exhibited a significantly elevated survival rate (P < 0.05) upon infection with individual strains ΔacrA, ΔacrB, ΔbvgS, or ΔsmpB (Fig. 6A). This finding strongly suggested that the virulence of A. veronii C4 was attenuated following the deletion of acrA, acrB, smpB, or bvgS, thereby validating their roles as VFs, consistent with the predictions made by pLM4VF. The knockout of acrB had the most substantial effect on the survival rate of mouse macrophages (P < 0.01), highlighting the pivotal role of AcrB in the virulence of A. veronii C4. Although both MP3 and DeepVF identified AcrB as a potential VF, the predicted scores for AcrB by these methods were considerably hovered around the thresholds (Additional file 1: Table S14).
Virulence validation of AcrA, AcrB, BvgS and SmpB. A Cell viability analysis of mouse macrophages infected with the wild-type (WT), ΔacrA, ΔacrB, ΔbvgS, or ΔsmpB strains (n = 5). B Quantification of colony-forming units (CFU) of WT and ΔacrB strains isolated from organs of infected mice (n = 5). Data in A and B are shown as mean ± standard deviation. Statistical significance was determined by t-test, *P < 0.05, **P < 0.01, and ***P < 0.001, and “ns” indicated no significant difference. Histological examination of pathological alteration in the C kidney and D spleen of mice infected with the WT strain, the ΔacrB strain, or PBS at a Magnification of 40×. Scale bars are shown with indicated length (2000 μm and 50 μm)
To delineate the specific role of AcrB, both the WT and ΔacrB strains of A. veronii C4 were injected into mice intraperitoneally. After 8-h post-injection, the survival rate of mice was 100%. Subsequently, internal organs were harvested, and colony numbers were counted. Compared to the WT strain, the ΔacrB strain demonstrated reduced colonization ability in all tissues, with particularly notable decreases observed in the liver, spleen, and kidney (Fig. 6B). Additionally, pathological changes of the kidney, liver, and spleen were assessed through hematoxylin and eosin (HE) staining. Kidneys infected with the WT strain displayed atrophy of glomeruli and tubules, rupture of glomerular cells, apoptosis, and pronounced hemorrhaging (Fig. 6C). In contrast, no discernible alterations were observed in glomerular and tubular morphology following infection with the ΔacrB strain. These findings indicated that the virulence of A. veronii C4 was compromised in the absence of AcrB. Examination of the spleen provided further insights. Spleen cells infected with the WT strain exhibited visible enlargement, with larger nuclei, resulting in widened interstitial spaces and a fluffy appearance, indicative of an ongoing immune response within the spleen (Fig. 6D). Conversely, spleen cells infected with the ΔacrB strain showed a denser cell arrangement and narrower interstitial spaces, closely resembling the pattern observed in uninfected spleen cells. Taken together, these results suggested that the virulence of A. veronii C4 in the mouse model decreased following acrB knockout, underscoring the influence of AcrB on the virulence of A. veronii C4. Therefore, these findings offered further validation of pLM4VF’s performance in predicting VFs.
Discussion
Bacterial infections have emerged as a significant global healthcare concern, imposing a substantial economic burden [22]. The annual treatment cost for several bacterial infections reached US $4.6 billion in the USA [23]. An efficacious approach for disarming pathogenic bacteria is the antivirulence strategy by blocking VFs [4]. However, the limited number of identified VFs severely impedes the development of drugs and vaccines. Additionally, existing methods for VF prediction suffer from various shortcomings, restricting the exploration of VFs. Therefore, in this study, we present pLM4VF, a meticulously designed predictive framework for VFs based on ESM pLMs. Compared to existing methods, the novelty of our study lies in several aspects. Firstly, separate models for G + and G − bacteria were constructed, tailored to accommodate the unique characteristics of their VFs. Secondly, a thorough and comprehensive comparison was conducted among ESM pLMs and traditional descriptors across six different ML algorithms. Thirdly, the stacking strategy was employed to train ensemble models based on the optimal ESM pLMs for VFs of G + and G − bacteria. Furthermore, 6 base models were combined in various forms and Further integrated with one of 3 meta models, generating 126 ensemble models for each type of bacteria. Among them, the top-performing ensemble models for both G + and G − bacteria formed pLM4VF. Finally, except for independent test datasets, the reliability of pLM4VF was also evaluated using cytotoxicity and acute toxicity assays.
Although the separation of G + and G − datasets resulted in reduced sample sizes for model training, which could theoretically impact prediction reliability, this limitation is substantially mitigated through our implementation of pretrained ESM models. These advanced models were trained on comprehensive protein sequence databases, encompassing millions of protein sequences across the evolutionary spectrum. This extensive pretraining enables the models to extract rich semantic and structural features, even from smaller datasets, thereby enhancing their ability to identify VFs effectively. The decision to use traditional ML methods instead of DL approaches reflects careful consideration of our data constraints. While DL has revolutionized the field of bioinformatics [24, 25], its application to VF prediction presents challenges in small-data regimes, as highlighted by DeepVF’s systematic evaluation [9]. Their findings indicate that DL models perform worse than traditional ML methods when trained on limited VF datasets. Based on these findings and considering our small-sized dataset, we strategically selected six well-established traditional ML algorithms spanning three complementary paradigms: (1) distance-based learning (KNN), (2) margin optimization (SVM), and (3) ensemble methods including both bagging (RF) and modern gradient boosting implementations (XGBoost, LightGBM, and CatBoost). The inherent heterogeneity in algorithmic principles and characteristics of these models enhances ensemble diversity, thereby improving overall predictive performance and stability.
Traditional descriptors typically face various challenges. For instance, AAC and DPC lack sequential information, while PSSM introduces additional noise and uncertainty due to the need for protein standardization to the same length during PSSM profile generation. Furthermore, although the concatenated descriptor “EleTra” potentially encompasses more comprehensive information compared to individual descriptors, it may lead to feature redundancy and the curse of dimensionality. Additionally, its fundamental reliance on shallow, hand-engineered representations constrained its ability to capture subtle but biologically critical patterns. In contrast, ESM pLMs can handle proteins of varying lengths without introducing irrelevant information, achieved by averaging the representation of each amino acid. They stand out in generating global contextual embeddings by considering intricate local and long-term interdependencies between amino acids, as demonstrated in both our study and Du et al.’s work [26]. This may explain why SVM is more compatible with ESM pLMs compared to decision tree algorithms such as RF, XGBoost, LightGBM, and CatBoost. SVM could effectively leverage these interdependencies by taking the entire protein sequence into account during training. It excels in high-dimensional pattern recognition by utilizing kernel functions to map feature vectors into high-dimensional space. This transformation resolves nonlinear separable issues, facilitating the identification of the maximum margin separating hyperplane [27, 28]. ESM-2-650 M and ESM-1b exhibited exceptional capability in capturing VF characteristics. For the G + dataset (smaller sample size), ESM-2 650 M’s enhanced architecture (650 M parameters, 33 layers, and 1280-dimensional embeddings) provides greater modeling capacity to capture subtle patterns despite limited training examples. Conversely, for the larger G − dataset, ESM-1b’s proven performance provides sufficient modeling power while maintaining computational efficiency. This stratified approach optimally leverages each model’s strengths, creating a balanced solution for comprehensive VF prediction across bacterial classes.
Our analysis revealed strong correlations between adjacent amino acids in the embedding space, with particularly pronounced effects at both termini of protein sequences. These findings align remarkably well with known principles of protein secondary structure formation, such as α-helices and β-sheets, where adjacent residues maintain specific dihedral angles [29]. The observed N-terminal correlations may reflect co-translational folding constraints, as the emerging polypeptide must adopt stable configurations during synthesis to prevent misfolding [30]. C-terminal correlations may reflect structural constraints for proper folding termination, such as β-strand pairing in β-barrels or salt bridges stabilizing the final folded state [31]. Systematic evaluation of 126 ensemble models for G + bacteria revealed that the optimal combination comprised the KNN, SVM, and RF base models with the LR meta-model. This configuration achieved superior performance by integrating the complementary strengths of each component. Specifically, KNN captured local sequence similarity patterns, SVM effectively handled the high-dimensional feature spaces using kernel-based nonlinear separation, while RF enhanced robustness through ensemble averaging of decorrelated decision trees. The LR meta-model optimally weighted these diverse predictions. Analogously, for G − bacteria, the inclusion of all six base models maximized feature space coverage, while an SVM meta-model further improved performance by capturing complex nonlinear interactions between base model outputs through its kernel transformation. Notably, the stacking ensemble’s superior performance comes with inherent interpretability limitations. The architecture’s heterogeneity (combining non-differentiable base classifiers with meta-learner) precludes direct application of Shapley Additive exPlanations (SHAP) analysis. Furthermore, the ESM-derived embeddings that form our input features, while biologically informative, lack explicit physicochemical interpretability—a recognized trade-off in modern protein representation learning.
We used pLM4VF to perform comprehensive VF prediction across the entire genome of A. veronii C4. The enrichment of predicted VFs in “transport”, “phosphorelay signal transduction system”, and “regulation of transcription, DNA template” aligns with established virulence mechanisms in pathogens. Transport can facilitate the uptake of nutrients, efflux of antimicrobial compounds, or secretion of virulence-associated molecules, all of which are crucial for survival and proliferation within the host [32]. Phosphorelay signal transduction system serves as master regulators of virulence through a phosphorylation cascade that translates environmental cues into coordinated gene expression, directly influencing pathogenic behaviors [33]. Transcriptional regulation coordinates the expression of virulence genes in a temporally and spatially controlled manner, allowing the pathogen to fine-tune its infection strategy [34]. These findings not only demonstrate pLM4VF’s predictive performance but also highlight the complex regulatory mechanisms that underlie bacterial virulence. Notably, pLM4VF successfully identified all four experimentally validated VFs, AcrA, AcrB, SmpB, and BvgS, while existing tools only detected partial subsets. This superior performance may stem from three synergistic factors: (1) The pLM embeddings that enable more comprehensive capture of evolutionary and structural patterns compared to traditional descriptors, (2) the bacterial type-specific architecture that more effectively recognizes the distinct virulence signatures of G + and G − bacteria, and (3) the two-level stacking architecture that combines multiple complementary base models to maximize predictive performance. As more VFs are experimentally identified, the performance of pLM4VF will be further improved when trained on larger datasets. In summary, pLM4VF emerges as a powerful tool for VF prediction, offering valuable insights into bacterial pathogenesis mechanisms and aiding in the development of novel antibacterial targets.
Conclusions
In this study, we present pLM4VF, an advanced predictive framework that integrates ESM pLMs with ensemble learning to accurately predict VFs in both G + and G − bacteria. Benchmarking experiments demonstrated that pLM4VF outperformed existing state-of-the-art methods. The superior predictive capability was further validated through cytotoxicity and acute toxicity assays. To maximize its utility, we have developed an online platform, facilitating seamless genome-wide VF analysis for researchers. We anticipate that pLM4VF will serve as a powerful tool to decipher bacterial pathogenicity, accelerate the discovery of novel antimicrobial agents and vaccines, and ultimately contribute to global efforts in combating bacterial infections.
Methods
Data preparation
A total of 1134 VFs from 16 G + bacterial species (Additional file 1: Table S15) and 3286 VFs from 39 G − bacterial species (Additional file 1: Table S16) were obtained as positive sample datasets, retrieved from the Victors database in July 2021 [3]. Analysis of VF distribution revealed that G + species exhibited 1–408 VFs (median = 19.5) (Additional file 1: Table S15), and G − species showed 1–559 VFs (median = 17) (Additional file 1: Table S16) per species. The substantial interspecies variation reflects research bias in the literature, with clinically prominent pathogens (e.g., Streptococcus pneumoniae, Escherichia coli) being overrepresented in current VF databases. This bias underscores the importance of our computational approach to identify potential VFs in less-studied species. For negative sample datasets, 2215 non-VFs from G + bacteria and 2695 non-VFs from G − bacteria were extracted from PBVF [10]. CD-HIT is a widely used bioinformatics tool for sequence clustering and redundancy reduction. The algorithm operates by comparing protein or nucleotide sequences and grouping them into clusters based on user-defined identity thresholds [20]. In our study, CD-HIT was separately applied to both positive and negative sample datasets with a 0.5 identity threshold to ensure diversity while preserving Functional relationships. To address the issue of data imbalance, an equal number of positive and negative samples were randomly selected through the RUS method. The balanced dataset was then partitioned into training and independent test sets through sampling at an 8:2 ratio. The independent test datasets were excluded from all training processes. Maximum sequence identity between training and independent test datasets was maintained at ≤ 50%, with the majority of sequence pairs (84.94% for G + bacteria and 83.47% for G − bacteria) exhibiting ≤ 40% identity (Supplementary Table 17). For G + bacteria, the training dataset encompasses 598 VFs and 598 non-VFs, while the independent test dataset comprises 150 VFs and 150 non-VFs. Similarly, in the case of G − bacteria, the training dataset consists of 1244 VFs and 1244 non-VFs, with the independent test dataset including 311 VFs and 311 non-VFs.
Feature extraction and normalization
Feature extraction using ESM pLMs
To systematically evaluate the performance of different pLMs in VF prediction, we compared multiple ESM model variants (ESM-1, ESM-1b, and ESM-2). These models represent an evolutionary progression in pLMs, with each iteration demonstrating significant improvements in both architectural design and training methodology. The original ESM-1 model establishes the baseline transformer architecture trained on UniRef50 sequences, while its successor ESM-1b enhances representational capacity through increased model depth and broader training. The subsequently developed ESM-2 incorporates several key innovations, including optimized attention mechanisms, an expanded training corpus, and architectural refinements. ESM-1, ESM-1b, and ESM-2 consist of five, one, and six pLMs, respectively. The pLMs vary from 48 layers with 15 billion parameters for 5120 output embeddings to 6 layers with 8 million parameters for 320 output embeddings. Considering Manageable RAM memory consumption, 8 out of 12 pLMs (Additional file 1: Table S1) were employed to extract VF features for both G + and G − bacteria. The final hidden state of each pLMs represents the output embedding matrix (n × dim) for the input sequence, where n denotes the sequence length and dim is determined by the specific pLM. For example, the dim of ESM-1b is 1280. A 1 × dim feature is generated for each input sequence through average pooling, enabling sequences of varying lengths to be standardized to the same length and further input into the model for VF prediction.
Feature extraction using traditional encoding methods
Group 1: Sequence-based features
AAC
AAC represents the frequency distribution of each of the 20 natural amino acids (namely, A, C, …, Y) within a given protein sequence [35] and was denoted by \({\text{f}}_{{\text{a}}_{\text{i}}}\):
In this context, “L” signifies the length of the protein sequence, and “ai” represents the number of occurrences of the i-th amino acid out of the 20 natural amino acids present within the given protein sequence. Each protein sequence is encoded as a 20-dimensional feature vector.
DPC
DPC denotes the frequency distribution of each of 400 dipeptides (namely AA, AC, …, YY) within a given protein sequence [36]. It is represented by the following notation:
In this context, “L-1” corresponds to the number of dipeptides in the protein sequence, and “pi” signifies the number of occurrences of the i-th dipeptide out of the total 400 dipeptides in the protein sequence. Each protein sequence is encoded as a 400-dimensional feature vector.
PseAAC
Pseudo-ACC (PseAAC) serves as an extension of AAC that incorporates the information of sequence order [37]. This feature representation is depicted as a (20 + λ)-dimensional feature vector. The initial 20 dimensions account for the effect of AAC, while the remaining λ dimensions reflect the influence of sequence order. The feature vector, denoted as X and encoded using PseAAC, is expressed as follows:
Xu is calculated as follows:
In this equation, “L” represents the length of the protein sequence, “fu” denotes the frequency of the u-th amino acid out of the 20 natural amino acids within the protein sequence, “θj” stands for the j-th sequence correlation factor, and “ω” is the weight factor. In this study, we set ω = 0.05 and λ = 20. Each protein sequence is encoded as a 40 (20 + λ)-dimensional feature vector.
QSO
Quasi-sequence order (QSO) profiles the sequence order information, building upon the basis of AAC [38]. In the context of QSO, the Schneider-Wrede and Grantham distance matrices are employed to calculate the pairwise distance between amino acids, as outlined below:
The “L” signifies the length of the protein sequence, where “maxlag” represents the Maximum lag width and is set at a value of 10. “fr” denotes the frequency of the r-th amino acid out of the 20 natural amino acids within the protein sequence, while “ω” represents the weight factor, with a default value of 0.1. Additionally, “distj,j+d” indicates the distance between the j-th and (j + d)-th amino acids within the protein sequence, computed based on the Schneider-Wrede or Grantham distance Matrix. As a result, each protein sequence is encoded as a 60 (20 × 2 + maxlag × 2)-dimensional feature vector.
Group 2: Physicochemical property-based features
CTriad
Conjoint triad (CTriad) considers the properties of an individual amino acid and its neighboring amino acids within a protein sequence, grouping three consecutive amino acids as a single unit [39]. In this approach, the 20 natural amino acids are clustered into 7 distinct classes, primarily based on the dipoles and volumes of their side chains. This classification results in the formation of a 343 (7 × 7 × 7)-dimensional feature vector.
The “fi” denotes the frequency of the i-th tripeptide out of the total 343 tripeptides within the protein sequence.
CTDT
Composition, transition, and distribution-transition (CTDT) is a component of the CTD descriptor [40]. Within CTDT, each of the physicochemical properties enables the division of the 20 natural amino acids into 3 distinct classes. As a result, CTDT is formulated as a 21 (7 × 3)-dimensional feature vector.
The “L” represents the length of the protein sequence, while “nAB” signifies the count of dipeptides formed by amino acids belonging to Class A and Class B.
Group 3: Evolutionary information-based features
Each row within the PSSM corresponds to a specific position in the protein sequence, with each column representing 1 of the 20 natural amino acids [36]. Consequently, a protein sequence with a length of L generates an L × 20 PSSM. The PSI-BLAST searches against the UniRef50 database were conducted with the parameters num_iterations = 3 and e-value = 0.001. In this study, four distinct forms of PSSM transformation were employed, including PSSM_composition, RPM_PSSM, S_FPSSM, and PsePSSM.
PSSM_composition
PSSM_composition involves a row-wise transformation of the PSSM. In this transformation, the row vectors associated with the same amino acid are summed and subsequently averaged [36].
Subject to
Here, “Ri” represents the i-th row of the PSSM_composition, “rk” denoted the k-th row of the PSSM, “pk” signifies the k-th amino acid in the protein sequence, “L” is the length of the protein sequence and “ai” stands for the i-th amino acid out of the 20 natural amino acids. To create the feature representation, each protein sequence is encoded as a 400-dimensional feature vector achieved by concatenating the rows of the 20 × 20 matrix in a sequential manner.
RPM_PSSM
To generate PPSSM, negative values within the PSSM are substituted with 0. RPM_PSSM, based on the PPSSM, is calculated using a formula akin to that used for PSSM-composition [41].
Subject to
In this context, “Ri′” represents the i-th row of the RPM_PSSM, “rk′” stands for the k-th row of the PSSM, “pk′” denotes the k-th amino acid in the protein sequence, “L” is the length of the protein sequence, and “ai”"signifies the i-th amino acid out of the 20 natural amino acids. As a result, each protein sequence is encoded as a 400-dimensional feature vector.
S_FPSSM
To generate FPSSM, values exceeding a threshold, denoted as n, within the PPSSM are substituted with n. In this study, n was set to 7. Based on the FPSSM, S_FPSSM is calculated as follows [42]:
Subjected to
In this equation, “Sj(i)” signifies the element located in the i-th row and j-th column of the S_FPSSM, “fpk,j” represents the element within the k-th row and j-th column of the FPSSM, “pk” denotes the k-th amino acid in the protein sequence, “L” corresponds to the length of the protein sequence, and “ai” stands for the i-th amino acid out of the 20 natural amino acids. Consequently, each protein sequence is encoded as a 400-dimensional feature vector.
PsePSSM
PsePSSM integrates the principles of both PseAAC and PSSM [43]. To begin, it is necessary to normalize the original PSSM.
The “Ei,j” denotes the element positioned in the i-th row and j-th column of the normalized PSSM, while “\({\text{P}}_{\text{i,j}}\)” signifies the element in the i-th row and j-th column of the original PSSM. Additionally, the calculation of PsePSSM is as follows:
The “L” represents the length of the protein sequence, with the constant value of “α” set to 1. Each protein sequence is consequently encoded as a 40-dimensional feature vector.
Group 4: SEQSIM feature
For a given protein sequence, we performed pairwise alignments against all sequences in the positive and negative training datasets using BLAST v2.12.0 +. Each alignment generated a bit score, excluding self-alignments. The highest bit scores from the positive and negative datasets were extracted and combined into a two-dimensional feature vector (“Posbit” and “Negbit”), forming the SEQSIM feature [9].
Feature normalization
Given the disparity in dimensionality among the original features, it is imperative to perform feature normalization to avoid any potential biases in prediction outcomes. In this study, the Z-score normalization method was executed to rescale the feature values, thereby achieving a standard normal distribution with a mean of 0 and a standard deviation of 1 [44]. The Z-score normalization is calculated as follows:
The “Z” represents the normalized feature value, “X” stands for the original feature value, “μ” denotes the mean of feature values, and “σ” signifies the standard deviation of feature values.
Model training and optimization
Base model
In this study, six classical ML algorithms were utilized to build base models, including KNN [45], SVM [28], RF [46], XGBoost [47], LightGBM [48], and CatBoost [49]. The first five algorithms were implemented with the scikit-learn library in Python, while CatBoost was implemented using the CatBoost library in Python.
Hyperparameter optimization
For the KNN algorithm, hyperparameter tuning was performed using tenfold cross-validated grid search, where the number of neighbors (n_neighbors) was varied from 1 to 49—approximately covering the range from 1 to the square root of the number of training samples—and the weighting strategy (weights) was set to either “uniform” or “distance.” In the case of SVM with the radial basis function (RBF) kernel, the hyperparameters C and γ were fine-tuned using tenfold cross-validated grid search. Both parameters were explored over 21 logarithmically spaced values between 2−10 and 210, generated using a base-2 log scale. Similarly, for the RF, grid search with tenfold cross-validation was performed to optimize the number of decision trees (n_estimators), the maximum depth of each tree (max_depth), and the minimum number of samples required to be at a leaf node (min_samples_leaf). The search space included n_estimators ranging from 100 to 900 (in increments of 100), max_depth from 1 to 49 (step size 1), and min_samples_leaf from 1 to 20 (step size 1). For the three algorithms—XGBoost, LightGBM, and CatBoost—numerous hyperparameters required tuning. For XGBoost, hyperparameter optimization was conducted using tenfold cross-validated grid search. The search space included learning_rate ranging from 0.01 to 0.2 in increments of 0.01, n_estimators ranging from 100 to 900 in increments of 100, max_depth and min_child_weight ranging from 1 to 9, and γ ranging from 0.0 to 0.5 in increments of 0.1. Other parameters such as the booster type (gbtree), objective function (binary:logistic), number of parallel jobs, and random seed were fixed throughout the tuning process. On the other hand, LightGBM and CatBoost—both of which involve a greater number of hyperparameters and more intricate dependencies—were optimized using the automatic hyperparameter optimization framework Optuna [50], which efficiently explores the search space via Bayesian optimization techniques. For the LightGBM model, hyperparameter tuning was conducted using Optuna with tenfold stratified cross-validation. The search space included a fixed number of estimators set to 1000, a learning rate ranging from 0.01 to 0.3, and a maximum bin size (max_bin) from 32 to 1024. The number of leaves (num_leaves) was explored between 20 and 1000, and the max_depth was varied from 3 to 10. Regularization parameters included reg_alpha and reg_lambda, both ranging from 0 to 0.1 in steps of 0.01. The minimum child weight (min_child_weight) was searched between 0 and 0.2, while the subsample and column sampling ratios (subsample and colsample_bytree) were explored from 0.2 to 0.95. The subsample frequency was fixed at 1, and the random seed was set to 27. This search space was carefully designed to cover key hyperparameters influencing model complexity and generalization, while Optuna’s Bayesian optimization approach ensured efficient and effective parameter exploration. For the CatBoost model, to fully leverage its advantages in structured data classification tasks, several key hyperparameters were tuned within a carefully designed search space. The n_estimators was set to range from 1000 to 6000, balancing model capacity and training time. The learning rate was varied between 0.01 and 0.2 to control the step size of updates, avoiding overly rapid convergence or underfitting. The depth of the trees was explored in the range of 1 to 10, allowing adjustment of model complexity to balance expressive power and the risk of overfitting. To prevent unnecessary over-training, an early stopping mechanism was implemented with an overfitting detector wait period (od_wait) of 500 iterations, and evaluation was conducted every 500 iterations (metric_period). These settings were chosen to leverage CatBoost’s strengths in handling categorical features and preventing overfitting, particularly in structured and moderately sized biological datasets.
Stacking model
Stacking is a robust ensemble learning technique that leverages the outputs of multiple base models as additional input features for the meta-model, thereby creating a more resilient and accurate ensemble model [51]. In this study, N (where N = 3, 4, 5 or 6) base models were randomly selected from six base models, whose prediction labels were utilized as input features of the meta-model (LR, SVM, or XGBoost) to build ensemble models.
Performance evaluation
Based on the confusion matrix, several performance metrics were employed to assess the model performance, including SN, SP, ACC, F1 score, and MCC as follows.
“TP” stands for true positives, “TN” for true negatives, “FP” for false positives, and “FN” for false negatives. Moreover, the ROC and PR curves were plotted. The AUROC and AUPR were calculated to qualify the performance of models.
Construction of gene knockout strains
The homologous recombination-based method was performed to construct the knockout strains for the genes acrA, smpB, and bvgS. The genomic DNA from A. veronii C4 was used as the template for amplifying both the upstream and downstream fragments of the target gene, which were further overlapped using overlap PCR technology. The resulting fusion fragment was inserted into the vector pRE112, previously digested with XbaI and SacI using the One Step CloneExpress enzyme. The recombinant plasmid was then transformed into E. coli WM3064 competent cell via heat shock. The transformed cells were plated on LB solid medium containing DAP and Chl, followed by incubation at 37 °C for 12 h. The positive clones were validated by DNA sequencing. The recombinant plasmid was extracted and introduced into the WT strain of A. veronii C4 via biparental Mating, and the gene knockout strain was selected on 8% sucrose plates.
Cytotoxicity assays
Mouse macrophages (RAW 264.7) were cultured in DMEM medium at 37 °C in a CO2 incubator. The RAW 264.7 cell line, derived from A-MuLV-induced tumors in BALB/c mice, is a well-established macrophage model that retains key functions such as phagocytosis and pinocytosis. It is widely used to investigate cellular responses to pathogens [52, 53]. A total of 5 × 104 cells were then inoculated into 96-well plates together with a bacterial population of 106 colony-forming units (CFU). Following a 2-h incubation, the cell activity was assessed using trypan blue staining. To evaluate the impact of acrA, acrB, bvgS, or smpB genes in A. veronii C4 on cytotoxicity, the strains were employed to infect mouse macrophages, respectively.
Acute toxicity test of mice
All the animal experiments were approved and conducted in compliance with the ethical guidelines and recommendations established by the ethical committee of Hainan University (approval number: HNUAUCC-2023–00001). Fifteen 4-week-old Male Kunming mice were purchased from the Drug Research Centre of Hainan Province. After a 3-day acclimation period, the mice were randomly allocated into three groups, with each group containing five mice. Two of the groups received intraperitoneal injections of either the WT or ΔacrB strain at a concentration of 105 CFU/g, while the remaining group was mock-infected with PBS. Following an 8-h postinjection interval, the mice were sacrificed and subsequently dissected. The heart, Liver, spleen, lung, and kidney were harvested, weighed, and homogenized using a tissue grinder with 1 mL 1 × PBS at 2000 rpm. The resulting mixture was then centrifuged for 3 min, and the supernatant was spread and cultured on a dish containing amp (50 μg/mL) resistance. After a 24-h incubation at 30 °C, the colonies were counted. The remaining organ samples were fixed in 4% paraformaldehyde for 24 h and sent to Wuhan Servicebio for paraffin-embedded sectioning (FFPE). Subsequently, these sections were stained with HE for microscopic observation and photography.
Web server development
The web server interface was developed using the Node.js and Express.js frameworks for the backend, while the front end was implemented using the angular framework. Users could select the model of G + or G − bacteria to predict VFs, with the prediction results displayed on the job list page.
Data availability
All data and code generated or analyzed during this study are included in this published article, its supplementary information files, and publicly available repositories (Zenodo, DOI: https://doi.org/10.5281/zenodo.16741404; GitHub, https://github.com/Liuyt-Bio/pLM4VF).
Abbreviations
- VF:
-
Virulence factor
- ML:
-
Machine learning
- G + :
-
Gram positive
- G − :
-
Gram negative
- AAC:
-
Amino acid composition
- DPC:
-
Dipeptide composition
- SVM:
-
Support vector machine
- PSSM:
-
Position-specific scoring matrices
- HMM:
-
Hidden Markov model
- DL:
-
Deep learning
- pLM:
-
Protein language model
- ESM:
-
Evolutionary scale modeling
- BERT:
-
Bidirectional encoder representations from transformers
- SEQSIM:
-
Sequence similarity
- RF:
-
Random forest
- KNN:
-
K-nearest neighbor
- XGBoost:
-
EXtreme Gradient Boosting
- LightGBM:
-
Light Gradient Boosting Machine
- CatBoost:
-
Categorical Boosting
- ACC:
-
Accuracy
- SN:
-
Sensitivity
- SP:
-
Specificity
- MCC:
-
Matthew’s correlation coefficient
- ESM-1-b:
-
Esm1b_t33_650M_UR50S
- ESM-2-650M:
-
Esm2_t33_650M_UR50D
- RUS:
-
Random under-sampling
- SMOTE:
-
Synthetic Minority Over-sampling Technique
- WT:
-
Wild type
- CFU:
-
Colony-forming units
- HE:
-
Hematoxylin and eosin
- SHAP:
-
Shapley Additive exPlanations
- PseAAC:
-
Pseudo-ACC
- QSO:
-
Quasi-sequence order
- CTriad:
-
Conjoint triad
- CTDT:
-
Composition, transition, and distribution-transition
- RBF:
-
Radial basis function
- FFPE:
-
Paraffin-embedded sectioning
References
Vos T, Lim SS, Abbafati C, Abbas KM, Abbasi M, Abbasifard M, et al. Global burden of 369 diseases and injuries in 204 countries and territories, 1990–2019: a systematic analysis for the global burden of disease study 2019. Lancet. 2020;396(10258):1204–22.
Leitão JH. Microbial virulence factors. Int J Mol Sci. 2020;21(15): 5320.
Sayers S, Li L, Ong E, Deng S, Fu G, Lin Y, et al. Victors: a web-based knowledge base of virulence factors in human and animal pathogens. Nucleic Acids Res. 2018;47(D1):D693–700.
Dickey SW, Cheung GYC, Otto M. Different drugs for bad bugs: antivirulence strategies in the age of antibiotic resistance. Nat Rev Drug Discov. 2017;16(7):457–71.
Zheng L-L, Li Y-X, Ding J, Guo X-K, Feng K-Y, Wang Y-J, et al. A comparison of computational methods for identifying virulence factors. PLoS One. 2012;7(8): e42517.
Sachdeva G, Kumar K, Jain P, Ramachandran S. Spaan: a software program for prediction of adhesins and adhesin-like proteins using neural networks. Bioinformatics. 2004;21(4):483–91.
Garg A, Gupta D. VirulentPred: a SVM based prediction method for virulent proteins in bacterial pathogens. BMC Bioinformatics. 2008;9(1):62.
Gupta A, Kapil R, Dhakan DB, Sharma VK. MP3: a software tool for the prediction of pathogenic proteins in genomic and metagenomic data. PLoS One. 2014;9(4): e93907.
Xie R, Li J, Wang J, Dai W, Leier A, Marquez-Lago TT, et al. DeepVF: a deep learning-based hybrid framework for identifying virulence factors using the stacking strategy. Brief Bioinform. 2021;22(3): bbaa125.
Rentzsch R, Deneke C. Predicting bacterial virulence factors – evaluation of machine learning and negative data strategies. Brief Bioinform. 2019;21(5): bbz076.
Sun J, Rutherford ST, Silhavy TJ, Huang KC. Physical properties of the bacterial outer membrane. Nat Rev Microbiol. 2022;20(4):236–48.
Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379(6637):1123–30.
Meier J, Rao R, Verkuil R, Liu J, Sercu T, Rives A. Language models enable zero-shot prediction of the effects of mutations on protein function. Adv Neural Inf Process Syst. 2021;34:29287–303.
Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci U S A. 2021;118(15): e2016239118.
Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics. 2007;23(10):1282–8.
Du Z, Ding X, Xu Y, Li Y. UniDL4BioPep: a universal deep learning architecture for binary classification in peptide bioactivity. Brief Bioinform. 2023;24(3):bbad135.
Yadav S, Vora DS, Sundar D, Dhanjal JK. TCR-ESM: employing protein language embeddings to predict TCR-peptide-MHC binding. Comput Struct Biotechnol J. 2024;23:165–73.
Qu Y, Niu Z, Ding Q, Zhao T, Kong T, Bai B, et al. Ensemble learning with supervised methods based on large-scale protein language models for protein mutation effects prediction. Int J Mol Sci. 2023;24(22): 16496.
Kalakoti Y, Yadav S, Sundar D. Transdti: transformer-based language models for estimating DTIs and building a drug recommendation workflow. ACS Omega. 2022;7(3):2706–17.
Fu L, Niu B, Zhu Z, Wu S, Li W. CD-hit: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28(23):3150–2.
Ma J, Zhao H, Mo S, Li J, Ma X, Tang Y, et al. Acquisition of type I methyltransferase via horizontal gene transfer increases the drug resistance of Aeromonas veronii. Microb Genom. 2023;9(9): 001107.
Lee J-S, Kim S, Excler J-L, Kim JH, Mogasale V. Global economic burden per episode for multiple diseases caused by group A Streptococcus. NPJ Vaccines. 2023;8(1):69.
Nelson RE, Hatfield KM, Wolford H, Samore MH, Scott RD, Reddy SC, et al. National estimates of healthcare costs associated with multidrug-resistant bacterial infections among hospitalized patients in the United States. Clin Infect Dis. 2021;72(Suppl 1):S17–26.
Krapp LF, Meireles FA, Abriata LA, Devillard J, Vacle S, Marcaida MJ, et al. Context-aware geometric deep learning for protein sequence design. Nat Commun. 2024;15(1):6273.
Wang Z, Xie D, Wu D, Luo X, Wang S, Li Y, et al. Robust enzyme discovery and engineering with deep learning using CataPro. Nat Commun. 2025;16(1):2736.
Du Z, Ding X, Hsu W, Munir A, Xu Y, Li Y. pLM4ACE: a protein language model based predictor for antihypertensive peptide screening. Food Chem. 2024;431: 137162.
Huang M-W, Chen C-W, Lin W-C, Ke S-W, Tsai C-F. SVM and SVM ensembles in breast cancer prediction. PLoS One. 2017;12(1): e0161501.
Song J, Li F, Takemoto K, Haffari G, Akutsu T, Chou KC, et al. Prevail, an integrative approach for inferring catalytic residues using sequence, structural, and network features in a machine-learning framework. J Theor Biol. 2018;443:125–37.
Li N, Lei Y, Song Z, Yin L. Helix-specific properties and applications in synthetic polypeptides. Curr Opin Solid State Mater Sci. 2023;27(5): 101104.
Komar AA. Unraveling co-translational protein folding: concepts and methods. Methods. 2018;137:71–81.
Shen C, Chang S, Luo Q, Chan KC, Zhang Z, Luo B, et al. Structural basis of BAM-mediated outer membrane β-barrel protein assembly. Nature. 2023;617(7959):185–93.
Kersey CM, Dumenyo CK. Regulation of corA, the magnesium, nickel, cobalt transporter, and its role in the virulence of the soft rot pathogen, Pectobacterium versatile strain Ecc71. Microorganisms. 2023;11(7):1747.
Kinch LN, Cong Q, Jaishankar J, Orth K. Co-component signal transduction systems: fast-evolving virulence regulation cassettes discovered in enteric bacteria. Proc Natl Acad Sci U S A. 2022;119(24): e2203176119.
Verma RK, Gondu P, Saha T, Chatterjee S. The global transcription regulator XooClp governs type IV pili system-mediated bacterial virulence by directly binding to TFP-Chp promoters to coordinate virulence associated functions. Mol Plant Microbe Interact. 2024;37(4):357–69.
Shalabi LA, Shaaban Z, Kasasbeh B. Data mining: a preprocessing engine. J Comput Sci. 2006;2(9):735–9.
Lingyun Z, Chonghan N, Fuquan H. Accurate prediction of bacterial type IV secreted effectors using amino acid composition and PSSM profiles. Bioinformatics. 2013;29(24):3135–42.
Chou K-C. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins. 2001;43(3):246–55.
Liu T, Zheng X, Wang J. Prediction of protein structural class for low-similarity sequences using support vector machine and PSI-BLAST profile. Biochimie. 2010;92(10):1330–4.
Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, et al. Predicting protein-protein interactions based only on sequences information. Proc Natl Acad Sci U S A. 2007;104(11):4337–41.
Chen Z, Zhao P, Li F, Leier A, Marquez-Lago TT, Wang Y, et al. Ifeature: a Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics. 2018;34(14):2499–502.
Jeong JC, Lin X, Chen XW. On position-specific scoring matrix for protein function prediction. IEEE ACM Trans Comput Biol Bioinform. 2011;8(2):308–15.
Zahiri J, Yaghoubi O, Mohammad-Noori M, Ebrahimpour R, Masoudi-Nejad A. PPIevo: protein-protein interaction prediction from PSSM based evolutionary information. Genomics. 2013;102(4):237–42.
Chou K-C, Shen H-B. Memtype-2l: a web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. Biochem Biophys Res Commun. 2007;360(2):339–45.
García S, Luengo J, Herrera F. Data preprocessing in data mining. 1st ed. Cham: Springer; 2015.
Wang L-N, Shi S-P, Xu H-D, Wen P-P, Qiu J-D. Computational prediction of species-specific malonylation sites via enhanced characteristic strategy. Bioinformatics. 2016;33(10):1457–63.
Breiman L. Bagging predictors. Mach Learn. 1996;24(2):123–40.
Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM; 2016. p. 785–94.
Meng Q. LightGBM: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst. 2017;3146–3154.
Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. CatBoost: unbiased boosting with categorical features. Adv Neural Inf Process Syst. 2017;6639–6649.
Akiba T, Sano S, Yanase T, Ohta T, Koyama M. Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM; 2019. p. 2623–31.
Liang X, Li F, Chen J, Li J, Wu H, Li S, et al. Large-scale comparative review and assessment of computational methods for anti-cancer peptide identification. Brief Bioinform. 2020;22(4): bbaa312.
Taciak B, Białasek M, Braniewska A, Sas Z, Sawicka P, Kiraga Ł, et al. Evaluation of phenotypic and functional stability of RAW 264.7 cell line through serial passages. PLoS One. 2018;13(6): e0198943.
Li N, Deshmukh MV, Sahin F, Hafza N, Ammanath AV, Ehnert S, et al. Staphylococcus aureus thermonuclease NucA is a key virulence factor in septic arthritis. Commun Biol. 2025;8(1):598.
Acknowledgements
We express our gratitude to Ziding Zhang from China Agricultural University, Yuan Zhou from Peking University and Xiaodi Yang from Peking University First Hospital for providing valuable suggestions.
Funding
This research was supported by the National Natural Science Foundation of China (32460244 and 32060153 to H. L.) and the Hainan Provincial Natural Science Foundation of China (225MS009 and 322RC589 to H. L.).
Author information
Authors and Affiliations
Contributions
Conceptualization, H.L. and Z.L.; Methodology, Y.L., X.C.1, J.L.1, and J.L.2; Software, T.L.; Investigation, Y.L., X.C.1, and J.L.1; Visualization, Y.L., X.C.1, and J.L.1; Writing – Original Draft, Y.L., X.C.1, and J.L.1; Writing –Review & Editing, H.L. and Z.L.; Resources, J.L.2, X.M., X.C.2, and Y.T.; Supervision, H.L. and Z.L. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
All the animal experiments were approved and conducted in compliance with the ethical guidelines and recommendations established by the ethical committee of Hainan University (approval number: HNUAUCC-2023–00001).
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
12915_2025_2374_MOESM1_ESM.xlsx
Additional file 1: Table S1. Protein language modelsused in this study. Table S2. Performance comparison of the models trained using protein language models for Gram-positive bacteria on the 10-fold cross-validation test. Table S3. Performance comparison of the models trained using protein language models for Gram-negative bacteria on the 10-fold cross-validation test. Table S4. Performance comparison of the models trained using traditional descriptors for Gram-positive bacteria on the 10-fold cross-validation test. Table S5. Performance comparison of the models trained using traditional descriptors for Gram-negative bacteria on the 10-fold cross-validation test. Table S6. Performance comparison of different sampling strategies on the Gram-positive bacterial datasets. Table S7. Performance comparison of different sampling strategies on the Gram-negative bacterial datasets. Table S8. Performance comparison of different CD-Hit thresholds for Gram-positive bacteria on the independent test. Table S9. Performance comparison of different CD-Hit thresholds for Gram-negative bacteria on the independent test. Table S10. Performance comparison of base models and different ensemble models for Gram-positive bacteria on the independent test. Table S11. Performance comparison of base models and different ensemble models for Gram-negative bacteria on the independent test. Table S12. Cross-validation of Gram-positive and Gram-negative models on the independent test. Table S13. Mainly biological processes involved in the virulence factors predicted by pLM4VF. Table S14. Performance comparison of pLM4VF and state-of-the-art methods. Table S15. The number of Gram-positive bacterial virulence factors from the Victors database. Table S16. The number of Gram-negative bacterial virulence factors from the Victors database.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Liu, Y., Cao, X., Li, J. et al. Advancing virulence factor prediction using protein language models. BMC Biol 23, 307 (2025). https://doi.org/10.1186/s12915-025-02374-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12915-025-02374-w