Ancestral genomes reconstruction: An integrated, multi-disciplinary approach is needed
A major tenet of Darwin’s theory of evolution, which will soon celebrate its 150 anniversary, is that all extant species share common ancestors, which are more or less distant in time. Over the last half century the ascent of genetics has given us many new tools to investigate the evolution of species. Advances in molecular cytogenetics, sequencing, and bioinformatics now allow hypotheses about the origin of the human genome. Molecular cytogenetics provided the first reconstructions of ancestral genomes (Wienberg and Stanyon 1995, 1997; Chowdhary et al. 1998). Chromosome flow sorting followed by DOP-PCR leads to reciprocal and multi-directional chromosome painting between all extant placental mammal orders. The results permitted hypotheses about the architecture and content of the ancestral placental mammalian karyotype, which have proved to be amazingly heuristic (Chowdhary et al. 1998; Glas et al. 1999; Froenike et al. 2003; Murphy et al. 2003; Richard et al. 2003; Yang et al. 2003; Svartman et al. 2004). Bioinformatics provided an alternative approach to reconstructing ancestral genomes (Bourque and Pevzner 2002). In a recent paper in Science, Murphy et al. (2005) made a systematic and comprehensive use of Bourque and Pevzner’s algorithm to reconstruct ancestral genomes essentially from data based on a Radiation Hybrids (RH) map of seven species. While confirming many of the conclusions from molecular cytogenetics about the Boreoeutherian ancestral genome, they proposed five additional syntenic associations that had apparently gone undetected by chromosome painting. These contrasting results lead to the March 2006 Forum in this journal highlighting the difference between these two approaches (Bourque et al. 2006; Froenicke et al. 2006).
Now, in this issue of Genome Research, Ma et al. (2006) present a new bioinformatic algorithm for sequence analysis and reconstruct the Contiguous Ancestral Regions (CAR) of the Boreoeutherian ancestor at a 50-kb resolution. Their research takes full advantage of the “Comparative Genomics” tracks, present in the UCSC Genome Database (http://genome.ucsc.edu) and exploits the complete sequence assemblies of human, dogs, rat, and mouse, using opossum and chicken as outgroups. The Ma et al. (2006) algorithm provides an exceptionally valuable tool especially in view of the increase in the number of sequenced mammalian genomes that will become available in the near future (http://www.genome.gov).
It is important to note that the reconstruction of ancestral genomes is not a mere jigsaw puzzle. Studies of phenomena affecting genome architecture such as chromosomal rearrangements, breakpoints, segmental duplications, and repositioning of centromeres are of crucial importance not only toward a full understanding of the forces that shaped our genome, but also in elucidating a growing number of pathologies directly or indirectly linked to features of genome architecture (Giglio et al. 2001; Armengol et al. 2003; Ventura et al. 2003; Bailey et al. 2004a; Lupski and Stankiewicz 2005; Murphy et al. 2005; Bailey and Eichler 2006; Kato et al. 2006). In this context, the Ma et al. (2006) work represents an important methodological improvement toward the reconstruction of the ancestral chromosomal architecture of Boreoeutherian mammals and understanding the origin of the human genome.
From the Forum discussion it was clear that cytogenetics and bioinformatics, unfortunately, do not communicate well with each other. It is our contention that the point is not which approach is better, but how a closer collaboration could be highly productive. Indeed, the Ma et al. (2006) results have removed many of the apparent differences between the two approaches and show a significant convergence with the chromosome painting for the ancestral syntenies present in the Boreoeutherian ancestor. The advantages of the Ma et al. (2006) approach is that it presents a view of intrachromosomal genome architecture lacking chromosome painting. It has a higher resolution for ancestral genome reconstruction and mitigates the potential inaccuracy of RH maps for close markers (Matise et al. 2002). However, it should be noted that three of the four species used by Ma et al. (2006), mouse, rat, and dog, are known to have the most rapidly evolving genomes of all mammals. These species are not optimal for phylogenomic reconstruction because high evolutionary rates make convergence more likely. High evolutionary rates often compound problems of discerning the ancestral states, because it is more difficult to distinguish homology from homoplasy. There are still only a small number of complete genome assemblies available, and, unfortunately, for the immediate future, the number of fully sequenced, phylogenetically appropriate genomes will continue to be limited. Another shortcoming to the Ma et al. (2006) method is that it assumes that the phylogeny is known. However, recent publications indicate that placental mammal phylogeny is still an open question.
Inconsistencies in results between cytogenetics and bioinformatics point out opportunities to reciprocally test hypotheses and improve ancestral genome reconstructions. For example, the ancestral associations predicted by Murphy et al. (2005) (1/22, 5/19, 2/18, 1/10, 20/2), not found in the painting reconstructions, are now rejected by Ma et al. (2006). However, Ma et al. (2006) did not recover the ancestral association 7/16 supported both by molecular cytogenetics and Murphy et al. (2005). Instead, their results show that the synteny of human 16 was conserved in the ancestor. This conclusion is not supported by any other publication. The most likely hypothesis is that 16p and 16q were still separate in the primate ancestor and only fused in the last common ancestor of Old World monkeys, apes, and humans (Misceo et al. 2003). Indeed, their first weakly supported join on CAR17 probably corresponds to the split on human 16. Ma et al. place gaps between CARs for various syntenies they considered likely, but which they could not confirm (CAR 3, 25, and 27 for the 4/8p association; CAR14 and CAR28 for chromosome 9; CAR 17, 24, and 26 for the 16/19 association). Further, bioinformatic reconstructions consistently hypothesize an ancestral synteny for chromosome 10, while chromosome painting strongly supports two independent chromosomes. These are some of the more outstanding problems and differences that should become the subject of future research.
Another limitation is the fidelity of sequence assembly as a consequence of the “shotgun” sequence methodology. These problems are particularly evident around centromere and pericentromeric regions, which are often sequencing black-holes. Quite revealing is that Ma et al. (2006) report that the estimated number of chromosome breaks is only a little higher in mouse and dog than in human line and that the number of breakpoints in the rat is more than seven times that of the mouse. They suspect that many predicted intrachromosomal breaks in rat are assembly artifacts. Such a suspicion needs to be thoroughly and independently tested.
The problem of breakpoint reuse also remains sticky. Murphy et al. (2005) found a high level of breakpoint reuse even uniting evolutionary breakpoints with cancer breakpoints, an important conclusion not supported by Ma et al. (2006). Bioinformatic simulations of Ma et al. (2006) suggest about equal frequency of breakpoints for any position in the genome. This result may be dependent on using the highly rearranged dog, mouse, and rat genome assemblies. It is not known if these conclusions would hold for conserved genomes. Here it is interesting to note that, in the analysis of the more detailed human genome sequence, these authors found 41.7% of human-specific breakpoints in segmental duplications.
Cytogenetics can help fill these gaps by rapidly accessing phylogenetically abundant data and testing bioinformatic reconstructions. For instance, appropriate co-hybridization FISH experiments of cloned DNA, essentially BACs (BAC-FISH), can independently test CAR orientation, adjacencies, and chromosomal breakpoints suggested by bioinformatics. This approach brings resolution and marker order definition (lacking when painting libraries are used) to the cytogenetic approach and extends the resolution, typical of the bioinformatic methodology, to a large number of species for which no RH or sequence data are available.
The number of BAC libraries from a good phylogenetic array of species has grown in the last years, thanks, mostly, to the extensive effort of P. de Jong’s laboratory (http://bacpac.chori.org/libraries.php). Further, BAC clones from a species can be efficiently used in other related species. Human BACs usually yield good FISH signal not only in Hominoidea, but also in Old World Monkeys (OWM) and in New World Monkeys (NWM), providing coverage over a phylogenetic interval in excess of 40 million years. As an additional example, bovine BAC clones have been used, in our laboratory, with good results on horse, pig, and whale. It is not necessary to have BAC libraries for all species, and one or two index species for each mammalian order will probably prove sufficient. Consequently, the already available libraries can cover most of the extant mammalian species. An alternative strategy might be to use bioinformatics to select human BACs with highly conserved content to be used with success across most mammalian orders.
The Ma et al. (2006) methodology mainly relies on fully sequenced genomes. Unfortunately, as stated, the already available, fully sequenced genomes are not very suitable for ancestral genome reconstruction. However, there is common ground where sequencing and cytogenetics can meet to generate high quality Synteny Block (SB) analyses: the end sequencing of appropriate BAC libraries. A BAC can be allocated to a SB by FISH, or, much more precisely, by placing its End Sequences (BES) on the human sequence, which is usually used as a reference. This method establishes an extremely precise connection between cytogenetic and bioinformatic data sets. The end sequencing of an entire library is relatively expensive, but it is mandatory when sequence contigs produced by the shotgun method are assembled in scaffolds. Indeed, an increasing number of end-sequenced BAC libraries are becoming available as genome projects expand (Everts-van der Wind et al. 2005; Leeb et al. 2006). We report, in the Supplemental material (available online at www.genome.org), two examples that illustrate the use of the BAC-FISH in this context. The first example shows the fine characterization of a chromosomal breakpoint in Nomascus leucogenys gibbon. The second one reports the SB reconstruction of macaque chromosome 6.
As mentioned, there are several mammalian genomes being sequenced. All of them, however, will be sequenced using the shotgun methodology (see below) and, in most cases, at a relatively low resolution (∼2× coverage). The end sequencing of the BAC library of these species would be of extreme help in defining the SB organization of the species under study. In turn, a precise definition of SB arrangement will smooth the progress of sequence assembly and, indirectly, a correct reconstruction of ancestral genomes.
In addition to the SB definition, the BAC-FISH methodology is also crucial in dealing with two biological phenomena that attracted attention over the last years and that cannot be approached using the sequencing method alone: centromere repositioning and segmental duplication.
Centromere repositioning (CR)
Centromere repositioning is the most evident example of the limitations of both the painting technique and bioinformatics reconstructions of ancestral genomes. Indeed, the Ma et al. (2006) paper, as well as the Murphy et al. (2006) work, did not consider centromeres in their chromosome reconstructions. It is a striking example of the utility of marker order definition in nonsequenced species using the BACs.
CR consists in the movement of a centromere along the chromosome without marker order alteration. It implies the inactivation of the old centromere. The evolutionary new centromere rapidly acquires the “normal” complexity characterized by centromeric satellite heterochromatin repeats. Several examples of CR events have been reported in primates (Montefalcone et al. 1999; Eder et al. 2003; Ventura et al. 2003, 2004) and other vertebrates: in cattle (Band et al. 2000), in equids (Carbone et al. 2006), in pig (M.F. Cardone, A. Alonso, P. Pazienza, M. Ventura, G. Montemurro, L. Carbone, P.J. de Jong, R. Stanyon, P. D’Addabbo, N. Archidiacono, et al. in prep.), in birds (Kasai et al. 2003). It has been hypothesized in rat (Zhao et al. 2004), marsupials (Ferreri et al. 2005), and also in rice (Nagaki et al. 2004).
In humans, centromeres of chromosomes 3, 6, 11, 14, and 15 are repositioned centromeres with respect to the position of the centromere in the primate ancestor (Eder et al. 2003; Ventura et al. 2003, 2004). None of these new centromeres was envisaged by sequence analysis. The implications of CR phenomenon are not trivial for our understanding of biological and clinical phenomena. The position of two 3q26 neocentromeres, reported in clinical cases, corresponds to the repositioned centromere in chromosome 3 of OWMs (Ventura et al. 2004). Clinical neocentromeres at 5q24–26 map to duplicons, which flanked an ancestral centromere at 15q25 (Ventura et al. 2003). This ancestral centromere was inactivated following the fission of the ancestral chromosome that generated Hominoidea chromosomes 14 and 15. The duplication cluster located at 6p22.1 is also the remnant of an inactivated centromere (Eder et al. 2003). The relaxation of the heterochromatic environment in these regions is potentially involved in chimeric gene creation (Jackson 2003).
Segmental duplications (SD)
SDs are DNA segments mapping to more than one locus in the genome. They represent ∼5% of the human sequence (Bailey et al. 2002). Initially regarded as “leftovers,” their deep involvement both in genome evolution and in triggering genomic disorders is now well established (Lupski and Stankiewicz 2005; Bailey and Eichler 2006). Their correct detection, however, is not an easy job in sequence genome assembly (Eichler 2001). This problem is particularly exasperated when a “shotgun” sequence methodology is used. The “hierarchical” or clone-ordered-based approach, in which the reciprocal position of individual large DNA fragments, essentially BAC clones, is firmly determined before sequencing, is definitely superior in correctly detecting SDs. This “hierarchical” approach, due to higher costs, has been used, however, just for the human genome. For all the other genomes, SD detection remains a problem. The method by Bailey et al. (2002), based on the “depth of coverage,” developed to identify SDs, is very efficient, but it is unable to predict their location. These difficulties can be greatly alleviated if this methodology is coupled with BAC-FISH testing. This combined approach, for instance, showed that most of the SDs in mouse are organized as tandem duplications (Bailey et al. 2004b). The FISH analysis of ∼1053 random non-human primate BACs demonstrated that great-ape species have been enriched for interspersed segmental duplications compared with OWM and NWM (She et al. 2006).
Flexibility of the BAC-FISH approach
Appropriate BACs can easily verify inconsistencies among different data sets. For example, Murphy et al. (2005) reported that the carnivore ancestor and Cetartiodactyla ancestor share an identical chromosome 13 form, which differs from the human form by a small inversion. The cat form, however, was reported as identical to humans. To settle the inconsistency, we performed FISH co-hybridization experiments with appropriate cat BAC clones, mapping inside the suspected inversion. The results clearly showed that the inversion was also present in the cat (data not shown).
Conclusions
Both molecular cytogenetics and bioinformatics continue to make notable contributions to reconstructing ancestral genomes and tracing the origins of human chromosomes. These two methods provide independent data sets, which are highly complementary. Despite recent controversies the schemes of ancestral genomes presented by researchers in these two fields are remarkably similar and convergent. Points of disagreement represent a rich vein for future research. With the publication of the Ma et al. (2006) report, it seems clear that the time is ripe for an integrated approach for research in phylogenomics. Such a multidisciplinary research will bring increasing clarity to our hypotheses about the phylogeny of the genome.
Acknowledgments
MIUR (Ministero Italiano della Università e della Ricerca) and European Commission (INPRIMAT, QLRI-CT-2002-01325) are gratefully acknowledged for financial support. RS was supported by a grant “Mobility of Italian and foreign researchers residing abroad” from MIUR.
Footnotes
-
↵3 Corresponding authors.
↵3 E-mail rocchi{at}biologia.uniba.it; fax 39-080-544.3371.
↵3 E-mail roscoe.stanyon{at}unifi.it; fax 39-055-274.3017.
-
Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.5687906
- Copyright © 2006, Cold Spring Harbor Laboratory Press