Abstract
In a phylogeny, trustworthy reliability branch support estimates are as important as the tree itself. We show that reliability support values based on bootstrapping can be improved by combining sequence and structural information from proteins. Our approach relies on the systematic comparison of homologous intra-molecular structural distances. These variations exhibit less saturation than sequence-based Hamming distances and support the computation of tree-like distance matrices resolvable into phylogenetic trees using distance-based methods such as minimum evolution. These trees bear strong similarities to their sequence-based counterparts and allow the estimation of bootstrap support values, but they are sufficiently distinct so that their information content may be combined. The combined sequence and structure bootstrap support values yield improved discrimination between correct and incorrect branches. In this work we show that our approach, named multistrap, is suitable for the improvement of bootstrap branch support values using both predicted and experimental 3D structures.
Similar content being viewed by others
Introduction
The resilience of protein folds is well established and has routinely been used to infer homology across evolutionary timespans incompatible with sequence analysis1,2,3. This observation has led to the speculation that the quantitative comparison of protein folds could be used as a metric to resolve deep nodes in phylogenetic trees4,5,6.
The Root Mean Square Deviation (RMSD) has been designed to compare protein folds. RMSD measures require a rigid superposition of the considered structures4,7. The Template Modeling Score (TM-Score)8 is a normalized variation of the original RMSD, which is now routinely used when comparing alternative superpositions across structures of different lengths. The TM-Score does not, however, correct the RMSD’s high sensitivity to conformation changes9.
An obvious alternative to measuring global fold superpositions is the systematic comparison of all homologous intra-molecular distances (IMD). Several IMD-based comparison methods have been reported that do not require a superposition and are therefore less sensitive than RMSD to conformational variations. The first such method was the distance RMSD (dRMSD)10. Similar measures were later shown to be suitable for the identification of distantly related structural homologues using the DALI algorithm11,12. IMD comparisons were also used to quantify multiple sequence alignment (MSA) structural accuracy (O’Sullivan et al.13; Armougom et al.14). The Local Distance Difference Test (lDDT) score is the most recent IMD-based method. It leverages IMDs to assess the concordance of local structural features between two protein structures within specified regions, thus offering a detailed assessment of local similarity15. The lDDT has been extensively used to assess prediction accuracy in the last two CASP (Critical Assessment of Structure Prediction) challenges. In their original report, the authors15 used simulated data to demonstrate the increased robustness of IMD metrics over their RMSD-based counterparts when dealing with alternative conformations of the same protein.
IMD-based metrics bear numerical properties sufficiently similar to sequence-based distances so that they can be used to resolve phylogenetic trees using standard distance-based phylogenetic methods such as Neighbour Joining16, or Minimum Evolution tree reconstruction methods (ME)17. In this study, we explore the potential of intra-molecular distances to be treated as evolutionary characters and set out to ask if these characters could either help the reconstruction of phylogenetic trees or provide new ways of estimating branch reliability. The use of structures for phylogeny reconstruction remains limited because it has been difficult to identify independent objective reference datasets that would allow the distances to be tested in a manner similar to other distance metrics18.
In this work, rather than trying to quantify the relative merits of various alternative phylogeny reconstruction methods, we focus our efforts on measuring the agreement between structure-based and sequence-based phylogenies in a controlled environment. We do so by systematically exploring the relationship between distance-based methods, using sequences or structures, and Maximum Likelihood (ML) reconstruction methods, which are generally considered, along with Bayesian methods, to deliver the most accurate phylogenies19,20,21,22,23. Our results show a significant level of congruence between sequence and structure-based phylogenetic reconstructions. We take advantage of this property to design a hybrid bootstrap support method named multistrap24, which combines sequence and structural information. Using a downsampling strategy we find this hybrid support able to better discriminate between correct and incorrect branches than similar methods based on sequence or structure alone.
Results
The IMD metric is tree-like and exhibits lower saturation than Hamming distances
A key property of evolutionary characters is to enable estimates of evolutionary distances in such a way that the direct measurements made between two taxa closely approximate their true evolutionary distance. For instance, it is generally accepted that the number of mutations separating two sequences provides an indication of their evolutionary distance. Yet, the exact quantification of this number is complicated by the fact that the same site can be mutated multiple times, thus making direct mismatch counts like Hamming distances (pdist) distorted estimates of the true distances. The pdist measure therefore yields underestimates of the true distance; corrected distances like LG + G25, provide a better estimate of the true number of substitutions. Yet, this correction can only be effective to a certain extent, as multiple mutations accumulated over time will ultimately lead to saturation. In practice, saturation results in the variance of the LG + G corrected measure increasing exponentially with pdist until it becomes so high that no meaningful distance can anymore be estimated. We set out to explore the suitability of TM and IMD to measure evolutionary distances by comparing their saturation properties with those of pdist and ME, the pdist metric corrected with the LG + G model.
We carried out this analysis by assembling a reference benchmark featuring 508 datasets, each made from 10 to 133 homologous protein sequences with available experimental 3D structures. These datasets were obtained by collecting all the experimental PDBs associated with PFAM domain families v28 (see methods). We estimated an MSA on each dataset using mTM-align, the best-performing multiple sequence aligners on this benchmark (Supplementary Fig. S1). For each MSA, we estimated an ML tree with IQ-TREE (default command line using ModelFinder to identify the best substitution model)26. For each pair of sequences, we extracted the ML patristic distances which we used as references in the remainder of this study. We also extracted from each MSA all possible pairwise projections on which we measured the pdist, the ME, TM and IMD metrics. Each value thus collected was normalized by its corresponding global median (e.g. the IMD measures were normalized by the median on all the combined IMD values). The purpose of this normalization was to make variations comparable across datasets and metrics.
We analyzed the saturation process by comparing, on each dataset, the variation of one of the four metrics (pdist, ME, TM and IMD) with its corresponding ML patristic distances. A selected example is displayed in Fig. 1a–d. On this well-powered dataset, one can see how the pdist signal tends to saturate on remote homologues, contrarily to IMD, TM and ME whose variation with the ML patristic distance remains quasi-linear. To generalize this observation we collected all datasets sufficiently powered on the close and remote homologies side. We did so by applying as a threshold the median of the ML patristic distances (3.09) and retained all the datasets featuring at least 20 data points (sequence pairs) on each side of this threshold. This left us with 320 datasets (89% of the data points). In each dataset, we collected the slopes of two linear regressions estimated on the close homologues and on the entire dataset. In the absence of saturation (or amplification), the ratio between these two slopes should be equal to 1.
Hexagonal binned density plot showing the relationship between pairwise patristic distances measured on the ML trees (x-axes) and the corresponding pairwise distances (y-axes) estimated with (a) the pdist, (b) ME, (c) TM and (d) IMD metrics on a selected representative dataset (PF13378). The pdist, ME, TM and IMD metrics were normalized by their respective global median values. The dashed red line represents the global median value of the ML patristic distances separating close and remote homogues (3.09). The light blue line indicates the linear regression fitted on the close homologues and the green line shows the regression fitted on the entire dataset. e Density plot showing a distribution of ratios between the linear regression slopes measured on the close homologues (light blue linear regression line in the first three panels) and the linear regression estimated on the complete datasets (green linear regression line in the first three panels) (n = 320). The dashed lines indicate the median values and are reported in Table 1. f Tree-likeness of the tree metrics, as measured by the coefficient of determination (R²) between the input and patristic distance matrices estimated on each dataset (n = 508) by ME (pink), TM (purple), and IMD (green). The pdist metric is not displayed on this panel as it was not used to derive any tree. The boxplots show the median whose values are reported in Table 2, the quartiles (25th and 75th percentiles) and whiskers that extend to 1.5 times the IQR.
We found these ratios to be significantly closer to 1 for the two structure-based metrics than for pdist (1.21 for TM and 1.42 for IMD as opposed to 2.21 for pdist, one-sided Wilcoxon signed-rank test all p-values < 2.2e-16), (Fig. 1e, Table 1, Supplementary Fig. S2A, B). These results suggest a stronger resilience to saturation for structure-based methods that are much less sensitive to multiple mutations than their sequence-based counterparts. The lower saturation observed in the structural metrics comes at the cost of a lower resolution as reflected by their lower coefficient of determination (R²) values on close homologues (0.80 for pdist as compared with 0.48 for TM and 0.58 for IMD), (Table 1, Supplementary Fig. S2C, D). This same analysis shows that the slopes of the linear regressions measured on close homologues vary much more for the structural metrics than they do for pdist as judged from their respective standard deviations (0.13 for pdist as compared to 0.22 for TM and 0.23 for IMD). Even though they outperform pdist, the structure-based metrics remain inferior to ME, the corrected pdist. Indeed, ME is less sensitive to saturation (median slope ratio of 0.97), and the ME ratios are much less dispersed than TM and IMD (SD of 0.23 for ME as compared to 0.67 for TM and 0.87 for IMD). ME distances become, however, more uncertain when incorporating remote homologues as reflected by the decreasing R² values (0.87 on close homologues as compared to 0.79 when incorporating all the data points). This is a rather different behaviour from structure-based metrics whose R² slightly increases when incorporating remote homologues (0.48 and 0.57 for TM, 0.58 and 0.60 for IMD).
We further explored saturation separately on a series of datasets selected to be representative of the R² (IMD vs ML patristic distances) distribution. We found the relative saturation patterns to be fairly stable and consistent across the entire analysis, particularly on datasets powered with the largest number of sequences (Supplementary Figs. S3–6). We also measured the fraction of data points within 20% of non-saturation (i.e. those with slope ratios between 0.8 and 1.2), taking into account that when it comes to estimating true distances, amplification (i.e. ratio inferior to 1) is as problematic as saturation. While the uncorrected structural distances are inferior to ME by this criterion, we nonetheless found them to be dramatically superior to pdist (pdist: 1.25% datasets within 20% of 1, 21.56% for IMD, 35.94% for TM and 73.75% for ME, Table 1). All in all these results suggest that while they may be further improved, the uncorrected structure-based metrics constitute fairly accurate estimates of the true evolutionary distances and are definitely superior to sequences-based Hamming distances.
Aside from resilience to saturation, tree-likeness is another essential property of quantifiable evolutionary characters. Tree-likeness relates to the capacity of a metric to support the resolution of a distance matrix into a tree with limited distortion. It can be estimated by comparing the initial pairwise distances in the distance matrix with the patristic distances recovered from its corresponding tree. High tree-likeness results in an R² between the two sets of measures close to 1. We resolved the distance matrices into phylogenetic trees using FastME (LG + G model) 27, a well-established software for distance-based phylogeny reconstruction. The IMD and TM distance matrices were kept uncorrected while pdist were replaced by the FastME distance matrices (ME) used above, in which multiple substitutions are corrected for. Overall, the tree-likeness of the two structure-based metrics is higher than the one measured on the sequence-based metric (median R² of 0.970 for the IMD, 0.962 for the TM and 0.898 for ME trees) (Fig. 1e, Table 2). We obtained comparable results when repeating the analysis using four different MSA protocols that were either entirely based on structures (3D-Coffee, trimmed and untrimmed) or on sequences (T-Coffee, trimmed and untrimmed) (Supplementary Fig. S7, Supplementary Tables 1, 2).
Comparison of IMD and TM phylogenetic tree topologies
The above analyses stop short of quantifying the relative phylogenetic accuracies of IMD and TM trees. We addressed this issue by comparing the agreement of IMD and TM trees with their Maximum Likelihood counterparts (ML based on sequences) using the Robinson-Foulds (RF)28 topological distance (Fig. 2). Overall, we found that the IMD and TM trees had equal levels of topological agreement with their ML references in 156 out of 508 datasets, while they differed in the 352 remaining datasets. Among these 352 datasets, the IMD trees were the ones having the best agreement with the ML references (199 for IMD as compared to 153 for TM, one-sided Sign test p-value = 0.008). These results are consistent with the observation that IMD and ML trees yield a higher level of topological agreement than that measured between TM and ML trees (Congruence with ML trees is 0.40 for TM and 0.42 for IMD) (Table 2).
The topological differences of each of the IMD-based trees (x-axis) and TM-score-based trees (y-axis) with the corresponding ML trees were compared. The figure in the top left corner (199) indicates the number of datasets for which the IMD topologies are more similar to the ML reference than the corresponding TM topology. The figure in the bottom right corner (153) indicates the opposite situation, and the figure in the main diagonal indicates datasets whose IMD and TM topologies are equally different to their ML references. The difference between the top and bottom figures is significant (one-sided Sign test, p-value = 0.008).
Sequence and structural bootstrap support values agree strongly on highly supported branches
Phylogenetic tree topologies are only considered conclusive when complemented with branch support values—commonly estimated by Felsenstein bootstrap support19. We implemented the IMD method in such a way that the columns can be sampled with replacement and used to generate bootstrap replicates suitable for a standard Felsenstein bootstrap procedure (see 'Methods'). As expected, we found IMD bootstrap support values to be in broad agreement with similar support measurements computed on ML or ME trees (R = 0.77 between IMD and ML, R = 0.72 between IMD and ME) (Fig. 3a, b, Supplementary Figs. S8–9). We also found a strong correlation between the average IMD bootstrap support values of individual trees and the fraction of branches they share with ME or ML trees (R = 0.67 for both IMD vs ME and IMD vs ML). The topological agreement is especially strong when considering branches with high support values (Fig. 3c, d). On average, the IMD and the ML trees share 42.10% of their branches, a figure to be compared with branches having 100% support in the IMD trees that occur 83.17% of the time within ML trees (1591 branches in total) (Table 2). We achieved similar results when comparing the IMD and the ME trees (Supplementary Figs. S10–11). Altogether these results further strengthen our observation that the variations of intra-molecular distances capture aspects of protein evolution relevant to phylogenetic reconstruction. Bootstrap support values are, however, generally higher on the IMD trees (43.47% for ME trees as compared to 66.37% for IMD trees). These differences should not be interpreted as suggesting a higher reliability for the underlying topologies, but rather as a consequence of the high level of redundancy in the structural information, obviously constrained by the 3D models they are collected from.
a For each dataset, the average bootstrap support value of the ME tree (y-axis) is plotted against its IMD counterpart (x-axis) and coloured by the RF topological distance (R = 0.72). The black overlaying line shows the running average RF (x-tree, y-tree, window size of 10) computed over the datasets sorted by IMD bootstrap support values (R = −0.67). b represents a similar analysis with ML (R = 0.77, R = −0.67). c The red line represents the fraction of branches that have an IMD bootstrap support value ≥X (x-axis) and occur in the ME trees (all datasets put together). The blue line indicates which fraction of the total number of IMD branches (8813 branches in total) are represented by each red point. d represents a similar analysis using ML trees as a reference.
Structural variations carry denser evolutionary content than sequences
The lack of established reference datasets makes it difficult to objectively determine the relative accuracy of structure- and sequence-based trees. As an alternative to a reference-based analysis, we designed a controlled titration of the relative evolutionary information content within structures and sequences. Titration was achieved through a downsampling procedure. We first collected the 56 datasets (out of 508) featuring at least 200 nearly ungapped columns (less than 5% gaps in their mTM-align MSA). For each dataset, we sampled without replacement a sub-MSA of exactly 200 nearly ungapped columns, estimated the corresponding trees (IMD, ME, ML), and collected the shared branches to be used as a reference (44% of the branches in total). Each dataset was then progressively downsampled by removing columns five by five down to a minimum of ten columns. At each point, the trees were re-estimated and re-evaluated for the fraction of recapitulated reference branches. The complete procedure (i.e. starting from the sampling of 200 columns in each of the 56 datasets) was repeated ten times and the resulting data was combined into a single titration graph (Fig. 4A, Supplementary Fig. S12A-B). When computing the IMD trees, 20 columns only are needed to recapitulate 75% of the reference branches, in contrast to 55 and 50 columns for ME and ML respectively (Table 3). To estimate the impact of alignment difficulty on these readouts, we ranked the 56 datasets by their average MSA percentage identity—a well-known proxy for alignment difficulty29—and carried out the titration on each of the corresponding quantiles (Supplementary Fig. S13). We found that the trend of a higher reference branch recapitulation in IMD trees is robust and remains unaltered across different levels of alignment similarity (Table 3). These observations support the usability of structural information when estimating phylogenies and also indicate that the information content within intra-molecular distances is overall similar to the one available in sequences, albeit slightly denser.
a Fraction of reference branches recovered on downsampled MSAs. The x-axis indicates the number of columns (out of a maximum of 200) kept in MSAs downsampled from 560 different MSAs (56 datasets sampled ten times each). The y-axis indicates the fraction of reference branches recovered. Data are presented as mean values ± SD (the envelope is the standard deviation). The gray dashed line indicates the number of columns required to recover 75% of the reference branches. b Represents a similar analysis carried out using the 49 datasets and their 490 MSA samples for which AlphaFold 2 predicted models could be obtained from the AF2 database.
Multiple bootstrap methods provide a means to complement sequence data with structural information
The lack of an objective independent criterion to determine the relative merits of our tree reconstruction methods led us to explore the suitability of combining alternative bootstrap support values so as to increase the robustness of branch reliability estimators. We named this approach multistrap24. Given an initial MSA and its associated target phylogenetic tree, the multistrap process starts by separately generating bootstrap replicates using different tree methods. These replicates are then combined using their arithmetic mean to estimate the final bootstrap support values of each target tree branch (Fig. 5a). We benchmarked multistrap on the titration dataset. The broad strategy involved estimating trees and bootstrap support on the complete datasets (i.e. non-downsampled) using ML, ME and IMD tree reconstruction methods, collecting reference branches defined by their high support and agreement across these methods, re-estimating the bootstrap support values using downsampled datasets, and evaluating the capacity of the various combinations of these downsampled bootstrap support values to discriminate between reference and non-reference branches. The analysis was made using all possible combinations of methods. The purpose of this combinatorial approach was to determine whether incorporating structural information into bootstrap support through the IMD could improve the discrimination of reference branches that were defined without using structural data.
a Visual representation of the multistrap procedure (input and outputs). The provided input tree has its branches evaluated for support using three independent replicate collections based on the same MSA. The multistrap values are eventually estimated by averaging these values. b Definition of the reference branches in the downsampling analysis. Given a tree, branch supports are independently evaluated using three distinct replicate collections based on the same MSA (IMD, ME and ML), and the final reference sets are assembled by considering all 7 possible intersections of branch sets having 80% support or more within the IMD, ME and ML initial support sets.
In practice, we first estimated an ML tree for each of the 56 non-downsampled datasets (200 columns). For each dataset, we generated 100 IMD, 100 ME and 100 ML tree replicates. Each replicate set was then used to estimate the bootstrap support value for each branch of the initial ML trees, thus yielding three distinct bootstrap support values for every branch of these ML trees (i.e. an IMD, an ME and an ML support). On the ML trees dressed with IMD support values, we collected all the branches having an IMD support superior or equal to 80 and set these aside as the IMD reference branch set (IMD80). We did the same with ME and ML supports (ME80 and ML80). Subsequently, we built extra reference branch sets by collecting all possible pairwise and overall intersections of these three reference sets (e.g. branches supported by 80% or more of the IMD replicates AND 80% of the ME replicates, hereafter referred to as IMD80 ∩ ME80, etc.), (Fig. 5b). These combinations provided us with seven sets of reference branches containing between 37.8% (IMD80) and 24.9% (IMD80 ∩ ME80 ∩ ML80) of the 829 internal branches occurring in the ML trees (Supplementary Table 3). Since these sets and associated trees are designed for a discriminative analysis, we removed datasets featuring only reference branches (or none). This left us with 52 to 55 usable datasets per reference branch set.
Given these reference branches, we set out to estimate the branch support metrics we wanted to test. To that effect, we computed new IMD, ME, and ML bootstrap replicate trees based on 25 columns only and used these downsampled replicates to re-estimate the branch supports in the initial ML trees (i.e. ML trees based on 200 columns had their branch supports re-estimated using replicates based on 25 columns). Given the bootstrap support values computed for each branch and the labels provided by one of the seven reference sets, it is possible to evaluate the discriminative capacity of the bootstrap method using a receiving operator characteristic area under the curve (ROC AUC). The AUC is equal to 1 if the bootstrap support values perfectly separate the references from the rest, or equal to 0.5 if the discrimination is random. The downsampled bootstrap support values were combined by averaging the alternative supports of a given branch using any of the seven possible bootstrap method combinations (e.g. ML, IMD + ML, IMD + ML + ME, etc.). We eventually used the seven possible reference branches datasets estimated on the complete 200-column MSAs and the seven possible bootstrap metrics estimated on downsampled 25-column MSAs to do an exhaustive all-against-all analysis (Table 4).
This table, which spans the entire combinatorial space, is well-suited to quantify the added value associated with any specific bootstrap method. For instance, when considering the ML80 line (i.e. reference branches defined using ML non-downsampled replicates), one can see that the discriminative capacity of the ML + IMD bootstrap support values (ML + IMD column) is superior to the one measured by ML-only (0.880 for ML + IMD and 0.843 for ML alone). Since on this line, the reference branches definition does not rely on any IMD data, one can conclude that the IMD information (i.e. the structural signal) entirely accounts for the observed difference. On this same line, similar reasoning can be applied when comparing the ME + IMD column with ME alone (0.858 for ME + IMD and 0.794 for ME) or the ME + ML + IMD column with ME + ML (0.868 and 0.832 respectively). Overall, when considering all similar matched pairs of values based on reference branches derived independently from IMD (i.e. ME80, ML80, ME80 ∩ ML80 lines) we systematically found a net gain associated with the addition of IMD to the bootstrap protocol. As one would expect, the results are even more pronounced when considering experiments in which the reference branches definition also includes IMD information (IMD80, IMD80 ∩ ME80, IMD80 ∩ ML80, IMD80 ∩ ME80 ∩ ML80). A systematic comparison of the possible combinations shows that the differences between +/-IMD columns are all significant (Supplementary Table 4). Very similar results were obtained when basing the analysis on ME trees rather than ML trees (Supplementary Table 5) or when using alternative ways of combining the downsampled bootstrap support values such as geometric mean, minimum, or maximum (Supplementary Table 6).
The AUC analysis supports the usefulness of combining sequence and structure-based bootstrap replicates but it does not indicate whether sensible general bootstrap support value thresholds exist that could be systematically applied to new datasets. We addressed this question by estimating for each dataset and each reference/bootstrap combination the bootstrap support value threshold providing an optimal separation between reference and non-reference branches. We defined the optimal threshold as the one yielding the highest possible Matthews Correlation Coefficient (MCC) and therefore providing a good compromise between the recovered fraction of true negative (sensitivity) and true positive (specificity). The results (Supplementary Table 7) indicate that in the three reference branch sets derived without IMD information (ML80, ME80, ME80 ∩ ML80), the inclusion of IMD structural information predominantly results in increased sensitivity and specificity. Adding IMD information also led to an increase in optimal bootstrap values of 3–5%, but this increase cannot be directly interpreted as an improvement as it simply reflects the higher numerical values observed for the IMD supports. We also observed that the MCC optimal average bootstrap support values are associated with relatively high standard deviations (Supplementary Table 8) that suggest a large dispersion of the optimal threshold when evaluating bootstrap support values. Interestingly, the standard deviations were 3–6 per cent points lower when considering IMD-based combinations, a decreased dispersion consistent with the notion that structural information contributes towards increased robustness of bootstrap reliability estimates.
Beyond experimental protein structures
At the time this manuscript is being written, the PDB contains about 230,000 experimentally determined protein structures—merely a fraction of the 250 million protein sequences contained in UniProt, and an even much smaller fraction of the billions of proteins to be expected by the end of the earth bio genome project. In such a context, the relevance of structural information when estimating phylogenies may be considered purely anecdotal. The recent report of highly accurate methods for protein structure prediction30 is, however, dramatically changing this situation and prompted us to explore the suitability of AlphaFold2 (AF2) models when doing IMD-based phylogenetic reconstructions. We, therefore, collected the available predicted 3D structures from the AlphaFold Protein Structure Database31 for the sequences used in the titration and reproduced the entire analysis using these models, including the MSA step. Since we relied on the AF2 structure database, all the required data was not available and only 49 out of the original 56 datasets could be analyzed. Nonetheless, the results achieved on AF2 models were very similar to those measured on experimental structures (Fig. 4b) and even slightly superior when comparing performances on the 49 datasets (Supplementary Fig. S12). This important observation suggests that there may be significant merits in systematically combining AF2 modelling information with sequence data when estimating phylogenetic tree reliability.
Discussion
The notion that structural variation could inform phylogenetic inference has long been intriguing and sometimes divisive to the scientific community4,6. On the one hand, fold resilience is expected to provide insights into the deepest branches of the tree of life, with some of the most ancient folds spanning billions of years of evolutionary history32, but on the other hand, this same resilience may lead to high levels of purifying selections and act as a confounding factor, as shown for instance in the reconstruction of ancient viral lineages33. On top of this, the very high pressure of selection under which protein folds evolve can induce convergent evolution processes, another major confounding factor in phylogenetic tree reconstruction. At the level of complete protein folds, this issue remains disputed with diverging and converging hypotheses co-existing to account for the emergence of ancient structural motifs, like the αβα sandwich5. Yet, locally convergent evolutionary processes are more widely accepted and documented, especially in regions evolving under strong purifying selection like enzymatic active sites34. Despite its confounding nature, convergent evolution is not specific to structures and can equally impair sequence-based phylogeny reconstruction35.
While convergent evolution may occur both on structures and sequences, the issue of conformational changes is specific to structures and represents yet another challenge for structure-based phylogeny. For instance, protein kinase domains can be found in a closed or open form9, a variation that may dominate a naive structural classification if left uncorrected. In this context, one can intuitively understand why IMD methods based on the comparison of internal distances should be less affected by conformational variations than RMSD approaches that are based on global fold comparisons. Indeed, whenever they occur, conformational variations will only affect a subset of the internal distances but will lead to spurious averaged rigid superpositions affecting all the distances measured by an RMSD. This hypothesis was thoroughly tested in the original validation of the lDDT metric where the authors used extensive simulations to demonstrate the lDDT capacity to recover the domain-based structural similarities even when applied globally on multi-domain proteins in different conformations. In this same study, the authors were able to show the superiority of their IMD-based approach over the Global Distance Test used in previous CASP contests and based on rigid superpositions. Our observation of better performances when using IMD rather than TM-based trees supports these observations. It nonetheless remains that both TM and IMD measures are sensitive to conformational variations and that when using them to reconstruct phylogenies, one should probably favour procedures in which multi-domain proteins are pre-processed into single domains and analyzed accordingly. This pre-processing is common practice when computing multiple sequence alignments.
The suitability of structure-based metrics to provide trustworthy evolutionary distances is a key aspect of this study. Our results unambiguously establish the superiority of these metrics over a simple Hamming distance like pdist and they also show the superiority of corrected distances like LG + G25. Structural evolution is therefore not totally immune to saturation and would probably benefit from a correction of some sort. In sequences, multiple hit corrections are explicitly meant to account for non-visible mutations including reversions. This phenomenon is unlikely to occur at the structural level whose fluctuations may be more similar to a kind of constrained 3D random walk. Corrections, if they have to be applied, would probably have to model this process.
The results collected in our report do not offer a definite answer on the respective merits of the TM and IMD metrics. In the saturation analysis, the TM behaviour is on average better than IMD while the IMD exhibits a higher level of tree-likeness and generates topologies in better agreement with ML. We based most of this study on the IMD because it provided us with a more convenient analytic framework than TM. For instance, the IMD offers a better continuity between sequence and structural analysis by allowing the effect of multiple aligners to be measured. Yet, in all fairness, we cannot rule out that similar results could have been achieved if the bootstrap had been implemented using a TM-based procedure, for instance by sampling groups of homologous carbon alpha rather than pairs of sites.
We show that phylogenetic trees based on the comparison of internal distances display many of the desirable properties of high-accuracy phylogenetic trees: they explain most of the variance (97% on average), and their highly supported branches are in strong agreement with ML trees used as standard of truth in this study. The prospect of being able to carry out structure-based phylogenetic analysis in a way comparable to sequence phylogeny is especially attractive. Of course, the most obvious motivation would be to use structures for the inference of very deep phylogenies. Unfortunately, our work casts little light on this possibility and any claim in this direction remains hampered by the lack of gold standards reference phylogenetic trees. This limitation also makes it impossible to objectively compare the relative merits of these methods. We did, however, develop a controlled set-up that allowed us to unambiguously establish that sequences and structures contain similar amounts of evolutionary signals and that these signals can effectively be combined to yield improved branch support estimates.
The reasons why structural and sequence information proves to be complementary to one another when doing bootstrap analysis can only be speculated upon. In sequence-based phylogeny, epistatic interactions are systematically ignored even though they are known to weigh on evolutionary rates36. Structure-based metrics are different in the sense that they effectively reflect variations at the scale of the entire fold and therefore tap into the non-local information spread across multiple sites.
Another intriguing observation distinguishing structure and sequence-based metrics is that the linear regression slopes measured on close homologues exhibit significantly more variation across families than the corresponding slopes estimated on pdist (standard deviation of 14.52 for IMD, 10.65 for TM, as compared with 4.95 for pdist). By some approximation, these slopes may be considered rough molecular clock estimates37. This observation would imply that the tolerance for mutations affecting a fold varies much more across protein families than the tolerance for residue substitutions.
An important motivation to implement the IMD as a distance-based method was pragmatic. Using a distance matrix framework makes it possible to leverage a well-established corpus of distance-based methods such as NJ or FastME and therefore carry out systematic comparisons in a precisely controlled fashion. The choice of a distance framework also responds to scalability issues. Indeed, phylogenetic methods will soon have to respond to queries several orders of magnitude larger than the largest instances currently contemplated. This expectation is putting distance-based methods in the spotlight as reflected by recent studies exploring their potential as phylogeny estimators18,38,39,40.
Recent progress in structural biology makes the timing of our report especially relevant. Thanks to AF230 and linguistic models41 the prospect of collecting trustworthy high-resolution structures for each and every known protein is now realistic. In fact, one of the most unexpected aspects of our results has been the observation of slightly better readouts on the AF2 models than on the original PDBs. While this came as a surprise, this observation may simply reflect the much higher degree of consistency across automatically generated models as compared with PDB structures individually collected over decades by hundreds of groups, a phenomenon recently highlighted in a large-scale analysis42 and consistent with recent observations on the suitability of AF2 models for MSA reconstruction43.
If it holds its promises, this massive amount of structural data will help power another major project: the post-sequencing analysis of the genomes of all existing animal species44. This wealth of data will make it possible to explore very large homologous datasets in which the systematic estimation of gene tree topologies should cast a new light on the emergence of the most basic molecular functions. This scaled-up genomics will require unprecedented amounts of data integration. With models spanning millions of genomes and billions of homologous sequences, we anticipate reliability assessment to become a major bottleneck. Multistrap addresses this precise issue by providing a simple and effective solution.
Methods
Multiple sequence alignment methods computation, trimming, and evaluation
Multiple sequence alignments were computed using T-Coffee (v13.45.60.cd84d2a)45 with default parameters. The structure-based multiple sequence alignments were computed using either mTM-align (v20180725)46 or 3D-Coffee13 using the -method=sap_pair,TMalign_pair option. The trimmed alignments were produced using the automated1 option of TrimAl47. After evaluating the MSAs for their structure-based accuracy using the NiRMSD option of the T-Coffee package (Supplementary Fig. S1) we selected mTM-align as the default structural aligner for the main analysis.
Reference benchmark dataset
A total of 508 datasets featuring homologous protein sequences with experimental X-ray structures48 were assembled using PFAM (v28.049;). The datasets were constructed so as to feature at least ten sequences and be successfully alignable (i.e. no computational abortion) by the three considered multiple aligners (T-Coffee, 3D-Coffee, mTM-align). The datasets contain between 10 and 133 sequences, with an average of 21 sequences per dataset for a total of 10,258 distinct structural segments. The average pairwise identity, as estimated on the T-Coffee MSAs, is 32%. The full dataset is available on Zenodo (https://zenodo.org/records/13123906).
Sequence-based distance matrix
Pairwise distances were estimated from the mTM-align MSAs using the LG model of FastME with a gamma shape parameter of 125. Gamma values could also have been estimated from ML trees, yet, since the purpose of this analysis was to compare alternative ways of building distance matrices, including some that were not amendable to ML estimation, we decided to use the default value of 1. Furthermore, as demonstrated in ref.50, using a slightly biased upward gamma estimate (1.0) is often more effective than the true value.
TM-score
Given a pair of sequences A and B, the TM-Score is computed by mTM-align using the formula
where \({L}_{S}\) is the sequence length of the smallest structure, \({L}_{C}\) the number of considered ungapped columns in the sequence alignment of A and B, \({d}_{i}\) the structural distance between the two α-carbons of residues in ith considered column and \({d}_{0}\) a normalization constant set to 0.5 for sequences shorter than or equal to 21 amino-acids or otherwise according to:
Intra-molecular distance-based metric
The IMD metric aggregates the differences of intra-molecular distances measured across pairs of homologous sites provided in the form of an MSA (i.e. pair of columns). Given the pairwise projection of two sequences A and B from an MSA and given all the possible pairs of ungapped columns x and y, with \({D}_{A}^{{xy}}\) being the distance between the α-carbons of the residues in column x and y of sequence A in Angström, the IMD metric is defined as follows:
Phylogenetic tree reconstruction and comparisons
The sequence-based ME trees (ME) were generated using FastME (v2.1.6.4, download instructions: https://github.com/l-mansouri/Phylo-IMD/blob/main/Dockerfiles/fastme_dockerfile) with the following command27:
fastme -i <alignment > -o <output filename > -m BioNJ -p LG -g 1.0 -s -n -z 5 -b 100 -B <replicates filename > -O <output distance matrix filename>
The structure-based trees (TM and IMD) were generated using FastME with the TM - or IMD-based distance matrix:
fastme -i <distance matrix > -s -n -z 5
The sequence-based maximum likelihood trees (ML) were estimated using IQ-TREE (v. 1.6.9)26:
iqtree -s <alignment > -b 100
The -m parameter is not explicitly used, therefore the best-fit model is automatically selected by ModelFinder, as specified in the corresponding IQ-TREE documentation.
Saturation analysis
For each dataset, pairwise distances were calculated using either IMD, TM or pdist, the Hamming distance measuring the fraction of mismatch among ungapped columns. The pairwise ML patristic distances were extracted from the ML trees. We measured the median ML patristic distance on the 508 datasets and used the resolution value (3.09) as a threshold to separate data points corresponding to close and remote homologue pairs in each dataset. In order to estimate the level of saturation affecting the measurement carried out on remote homologues, we compared the linear regression slopes and the corresponding R² values estimated on the close homologues, with the one measured on the entire dataset. The R² and the slope analyses were carried out on the IMD, TM and ME metrics normalized by their median value as estimated on the 508 datasets (i.e. IMD values normalized by the IMD global median, TM with the TM global median, and pdist with the pdist global median). Since this comparison requires sufficient power on each side of the threshold, we only considered datasets featuring at least 20 points on each side of the close homologues threshold. This selection left us with 320 datasets containing 89% of the total number of data points.
Tree comparisons
Topological comparisons between pairs of trees were carried out using the phangorn package51 implementation of the Robinson–Foulds (RF) distance measure28:
RF.dist(<tree1 > ,<tree2 > , normalize=T)
IMD bootstrap replicates
The IMD metric is implemented in the T-Coffee package45. The following command line generates a distance matrix useable by FastME:
export THREED_TREE_MODE = 10 t_coffee -other_pg seq_reformat -in <alignment file > -in2 <t_coffee template file > -action +replicates <number of bootstrap replicates, 100 in our case > +phylo3d +print_replicates -output dm
Here, the template file declares the PDB files associated with each sequence as follows:
> <name of the sequence in the alignment file> _P_ <PDB structure file>
This procedure supports the generation of both the complete distance matrix (first output matrix) and a total of 100 bootstrap replicates. For each replicate, a set of N columns is randomly drawn with replacement thus generating a new set of N columns with potential duplicates. Given this set, all possible pairs (i.e. top half of the N vs N matrix excluding the main diagonal) are collected thus generating a maximum of (N*(N-1))/2 pairs—less in practice because the duplicate columns resulting in identical pairs are discarded. For instance, if columns 1-1-3-5 have been drawn, the considered pairs will be 1-1 (discarded), 1-3, 1-5, 1-3, 1-5, 3-5 (Detailed algorithm on Supplementary Fig. S14). This two-step procedure makes it possible to avoid systematically sampling all possible sites, as would almost systematically happen if the N2 pairs were directly sampled with replacement from the original MSA. The replicates can be fed to any distance-based tree method, like FastME, so as to generate Felsenstein-like bootstrap branch support values. Replicate trees were generated with the command:
fastme -i <distance matrix > -g 1.0 -s -n -z 5
The replicate trees were then concatenated and the support values for each branch were calculated using the function prop.clades() in the phangorn package51.
Titration
The full benchmark dataset was first filtered to retain datasets featuring at least 200 columns, each containing a gap in less than 5% of the sequences. This filtering returned 56 datasets, which were used to generate replicates. In each dataset, 200 columns were randomly drawn without replacement, the resulting MSA was used to estimate ML, ME, and IMD trees. The branches common to all three trees were then collected and used as a reference. The procedure was repeated ten times, thus generating a total of 560 datasets containing 3371 reference branches in total (40.3% of the 8361 branches in the entire dataset). The titration was done by gradual downsampling, five columns at a time. At each downsampling step, the trees were estimated (ML, ME and IMD) and evaluated for the fraction of reference branches they contained. The quantile titration was carried out by sorting the original MSAs by average identity and dividing them into 5 quantiles of 11 to 12 datasets. The main data generated above was then separated according to these quantile bins, and titration was carried out on each quantile.
Titration on AlphaFold2 (AF2) predicted structures
Protein structures were collected for the 56 datasets of the titration dataset from the AlphaFold Protein Structure Database v4 with the 'get_structures' pipeline, accessible at https://github.com/luisas/get_structures/tree/phyloimd with the command
nextflow run main.nf -profile phylo3d
The database search was set-up with foldseek version c7e4a37856b49438eaee03bbfcdf1588cbce0695 with the command
'foldseek databases afdb Alphafold/UniProt afdb tmp'
Then, for every collected sequence, mmseqs (v14.7e284) was used to search for structures that meet two criteria: full coverage of the query protein sequence and a minimum amino acid sequence identity of 99%. In total, we managed to collect all the AF2 structures in 49 out of the 56 initial datasets.
Multistrap ROC curve and MCC analysis
An ML tree was estimated on one non-downsampled TMalign MSA (200 columns) for each of the 56 datasets used in the titration. Three sets of 100 bootstrap replicate trees were then computed on these same MSAs using IMD, ME and ML. These bootstrap collections were used to estimate three support values for each ML tree branch (i.e. IMD, ME, and ML), and the combination of these values was used to define seven distinct sets of reference branches. These sets were estimated by combining the bootstrap support values in all possible ways and by keeping branches having support values superior or equal to 80% (i.e. IMD80, ME80, ML80, ME80 ∩ IMD80, ML80 ∩ IMD80, ME80 ∩ ML80, ME80 ∩ ML80 ∩ IMD80). These sets were used as references for the benchmarking (Fig. 5B). Subsequently, three new bootstrap collections (IMD, ME, and ML) were estimated by using the downsampled TMalign MSAs (25 columns) to re-estimate the branch support bootstrap values of every branch of the initial ML trees (i.e. the bootstrap replicates estimated on 25 columns were used to dress up the trees estimated on 200 columns). A total of seven sets of multistrap branch support values were then produced by combining the support values collections in all possible ways (IMD, ME, ML, ME + IMD, ML + IMD, ME + ML, ME + ML + IMD). In each set, the multistrap values were defined as the arithmetic mean of the combined values. A ROC AUC analysis (R pROC library) was then carried out on each possible reference branch set/multistrap set (49 pairs in total) to quantify the capacity of the multistrap support values to discriminate between reference and non-reference branches. The AUCs were computed individually on every dataset and eventually averaged to yield the tabulated readouts. An optimal MCC analysis was carried out on the same data and used to identify the multistrap bootstrap support value thresholds yielding the highest MCCs on each dataset of each possible branch support/multistrap combination. The corresponding values were collected along with their associated thresholds, sensitivities (TP/TP + FN), and specificities (TN/TN + FN). The entire analysis was also done using ME trees instead of ML trees in the first step. Three other protocols were tested for the multistrap combination: geometric mean, and minimum, or maximum of the combined values.
Reproducing the multistrap analysis
The multistrap analysis presented here can be automatically deployed using the multistrap profile of the Phylo-IMD pipeline (https://github.com/l-mansouri/Phylo-IMD) with the command line:
`nextflow run main.nf -profile multistrap -fasta <id.fasta > -templates <id.template > -pdbs <id.seq1.pdb, id.seq2.pdb.. >`
The whole procedure is extensively documented along with input files and example input datasets on: https://github.com/l-mansouri/Phylo-IMD/blob/main/README.md. The multistrap procedure is also available as a standalone executable within the T-Coffee package: https://www.tcoffee.org.
Computation
Computation was carried out on a cluster running Scientific Linux release 7.2. The alignments and the phylogenetic reconstructions were run within separate containers based on the Debian GNU/Linux operating system.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The data used in this study are available in the Zenodo database under accession code 13123906. Source data are provided in this paper. Source data are provided with this paper.
Code availability
The pipeline was implemented in Nextflow (Di Tommaso et al., 2017) and was run using Singularity containers. The code is available at: https://github.com/l-mansouri/Phylo-IMD24.
References
Chothia, C. & Lesk, A. M. The relation between the divergence of sequence and structure in proteins. EMBO J. 5, 823–826 (1986).
Rost, B. Twilight zone of protein sequence alignments. Protein Eng. Des. Sel. 12, 85–94 (1999).
Illergård, K., Ardell, D. H. & Elofsson, A. Structure is three to ten times more conserved than sequence—a study of structural response in protein cores. Proteins Struct. Funct. Bioinforma. 77, 499–508 (2009).
Johnson, M. S., Sutcliffe, M. J. & Blundell, T. L. Molecular anatomy: phyletic relationships derived from three-dimensional structures of proteins. J. Mol. Evol. 30, 43–59 (1990).
Longo, L. M., Petrović, D., Kamerlin, S. C. L. & Tawfik, D. S. Short and simple sequences favored the emergence of N-helix phospho-ligand binding sites in the first enzymes. Proc. Natl Acad. Sci. 117, 5310–5318 (2020).
Malik, A. J., Poole, A. M. & Allison, J. R. Structural phylogenetics with confidence. Mol. Biol. Evol. 37, 2711–2726 (2020).
Johnson, M. S., Šali, A. & Blundell, T. L. Phylogenetic relationships from three-dimensional protein structures. in Methods in Enzymology vol. 183 670–690 (Elsevier, 1990).
Zhang, Y. & Skolnick, J. Scoring function for automated assessment of protein structure template quality. Proteins Struct. Funct. Bioinforma. 57, 702–710 (2004).
Rabiller, M. et al. Proteus in the world of proteins: conformational changes in protein kinases. Arch. Pharm. 343, 193–206 (2010).
Levitt, M. A simplified representation of protein conformations for rapid simulation of protein folding. J. Mol. Biol. 104, 59–107 (1976).
Holm, L. & Sander, C. Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 233, 123–138 (1993).
Holm, L., Kääriäinen, S., Wilton, C. & Plewczynski, D. Using dali for structural comparison of proteins. Curr. Protoc. Bioinforma. Chapter 5, Unit 5.5 (2006).
O’Sullivan, O., Suhre, K., Abergel, C., Higgins, D. G. & Notredame, C. 3DCoffee: combining protein sequences and structures within multiple sequence alignments. J. Mol. Biol. 340, 385–395 (2004).
Armougom, F., Moretti, S., Keduas, V. & Notredame, C. The iRMSD: a local measure of sequence alignment accuracy using structural information. Bioinformatics 22, e35–e39 (2006).
Mariani, V., Biasini, M., Barbato, A. & Schwede, T. lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics 29, 2722–2728 (2013).
Saitou, N. & Nei, M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406–425 (1987).
Rzhetsky, A. & Masatoshi, N. Theoretical foundation of the minimum-evolution method of phylogenetic inference. Mol. Biol. Evol. 10, 1073–1095 (1993).
Braun, E. L. et al. Testing the mettle of METAL: a comparison of phylogenomic methods using a challenging but well-resolved phylogeny. Preprint at https://doi.org/10.1101/2024.02.28.582627 (2024).
Felsenstein, J. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39, 783–791 (1985).
Felsenstein, J. Inferring Phylogenies. (Sinauer Associates, Sunderland, Mass, 2003).
Guindon, S. & Gascuel, O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52, 696–704 (2003).
Alfaro, M. E. Bayes or bootstrap? A simulation study comparing the performance of Bayesian Markov Chain Monte Carlo sampling and bootstrapping in assessing phylogenetic confidence. Mol. Biol. Evol. 20, 255–266 (2003).
Yang, Z. Molecular Evolution: A Statistical Approach. (Oxford University Press, Oxford, United Kingdom; New York, NY, United States of America, 2014).
Baltzis, A. et al. multistrap: boosting phylogenetic analyses with structural information, https://github.com/l-mansouri/Phylo-IMD, https://doi.org/10.5281/zenodo.14035502 (2024).
Le, S. Q. & Gascuel, O. An improved general amino acid replacement matrix. Mol. Biol. Evol. 25, 1307–1320 (2008).
Nguyen, N. D., Mirarab, S., Kumar, K. & Warnow, T. Ultra-large alignments using phylogeny-aware profiles. Genome Biol. 16, 124 (2015).
Lefort, V., Desper, R. & Gascuel, O. FastME 2.0: a comprehensive, accurate, and fast distance-based phylogeny inference. Mol. Biol. Evol. 32, 2798–2800 (2015).
Robinson, D. F. & Foulds, L. R. Comparison of phylogenetic trees. Math. Biosci. 53, 131–147 (1981).
Sievers, F., Dineen, D., Wilm, A. & Higgins, D. G. Making automated multiple alignments of very large numbers of protein sequences. Bioinformatics 29, 989–995 (2013).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Varadi, M. et al. AlphaFold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50, D439–D444 (2022).
Gruic-Sovulj, I., Longo, L. M., Jabłońska, J. & Tawfik, D. S. The evolutionary history of the HUP domain. Crit. Rev. Biochem. Mol. Biol. 57, 1–15 (2022).
Wertheim, J. O. & Kosakovsky Pond, S. L. Purifying selection can obscure the ancient age of viral lineages. Mol. Biol. Evol. 28, 3355–3365 (2011).
Holliday, G. L. et al. MACiE: a database of enzyme reaction mechanisms. Bioinformatics 21, 4315–4316 (2005).
Castoe, T. A. et al. Evidence for an ancient adaptive episode of convergent molecular evolution. Proc. Natl Acad. Sci. 106, 8986–8991 (2009).
Breen, M. S., Kemena, C., Vlasov, P. K., Notredame, C. & Kondrashov, F. A. Epistasis as the primary factor in molecular evolution. Nature 490, 535–538 (2012).
Zuckerkandl, E. & Pauling, L. Molecules as documents of evolutionary history. J. Theor. Biol. 8, 357–366 (1965).
Allman, E. S., Long, C. & Rhodes, J. A. Species tree inference from genomic sequences using the Log-Det distance. SIAM J. Appl. Algebra Geom. 3, 107–127 (2019).
Dasarathy, G., Nowak, R. & Roch, S. Data requirement for phylogenetic inference from multiple loci: a new distance method. IEEE/ACM Trans. Comput. Biol. Bioinform. 12, 422–432 (2015).
Braun, E. L. Phylogenomics using Compression Distances: Incorporating Rate Heterogeneity and Amino Acid Properties. in Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics 1–6 (ACM, Houston TX USA, 2023). https://doi.org/10.1145/3584371.3612996.
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
Sánchez Rodríguez, F., Simpkin, A. J., Chojnowski, G., Keegan, R. M. & Rigden, D. J. Using deep-learning predictions reveals a large number of register errors in PDB depositions. IUCrJ 11, 938–950 (2024).
Baltzis, A. et al. Highly significant improvement of protein sequence alignments with AlphaFold2. Bioinformatics 38, 5007–5011 (2022).
Lewin, H. A. et al. The Earth BioGenome Project 2020: starting the clock. Proc. Natl Acad. Sci. 119, e2115635118 (2022).
Notredame, C., Higgins, D. G. & Heringa, J. T-coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302, 205–217 (2000).
Dong, R., Peng, Z., Zhang, Y. & Yang, J. mTM-align: an algorithm for fast and accurate multiple protein structure alignment. Bioinformatics 34, 1719–1725 (2018).
Capella-Gutiérrez, S., Silla-Martínez, J. M. & Gabaldón, T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25, 1972–1973 (2009).
Burley, S. K. et al. Protein Data Bank (PDB): The Single Global Macromolecular Structure Archive. in Protein Crystallography (eds. Wlodawer, A., Dauter, Z. & Jaskolski, M.) vol. 1607 627–641 (Springer New York, New York, NY, 2017).
Finn, R. D. et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 44, D279–D285 (2016).
Guindon, S. & Gascuel, O. Efficient biased estimation of evolutionary distances when substitution rates vary across sites. Mol. Biol. Evol. 19, 534–543 (2002).
Schliep, K. P. phangorn: phylogenetic analysis in R. Bioinformatics 27, 592–593 (2011).
Acknowledgements
The research leading to these results has received funding from the Spanish Ministry of Science and Innovation (PRE2018-085039 funds for predoctoral contract funded by MICIU/AEI/10.13039/501100011033 and by the FSE invest in your future (L.M.), PRE2021-097947 funds for predoctoral contract funded by MICIU/AEI/10.13039/501100011033 and by FSE+ (L.S.)). We acknowledge support of the Spanish Ministry of Science and Innovation through the Centro de Excelencia Severo Ochoa (CEX2020-001049-S, MCIN/AEI /10.13039/501100011033), and the Generalitat de Catalunya through the CERCA programme (C.N.). O.G. is supported by the Paris Artificial Intelligence Research Institute (PRAIRIE, ANR-19-P3IA-0001). We are grateful to the CRG Core Technologies Programme for their support and assistance in this work. The authors are also very grateful to Edward L. Braun who extensively reviewed the manuscript and proposed very useful edits and suggestions, especially with respect to the saturation analysis.
Author information
Authors and Affiliations
Contributions
L.M., A.B., L.S., C.N. and O.G. designed the analysis, A.B., L.M. and L.S. carried out the validation, C.M. and D.M.V contributed to the validation design. All the authors designed the validation procedure and L.M., A.B., C.N., L.S., B.L. and O.G. wrote the manuscript. All authors have read and approved the manuscript for publication.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Edward Braun and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Baltzis, A., Santus, L., Langer, B.E. et al. multistrap: boosting phylogenetic analyses with structural information. Nat Commun 16, 293 (2025). https://doi.org/10.1038/s41467-024-55264-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-024-55264-0
- Springer Nature Limited