The technology for generating raw sequence data has advanced rapidly over the past 35 years, starting with Sanger sequencing in the 1970s, automated fluorescent Sanger sequencing in the 1980s and, recently, ultra-high- throughput methods based on the parallel sequencing platforms produced by 454, Illumina, and ABI. However, the scale of these advances has not been matched by new algorithms and tools for sequence assembly, particularly for large genomes. Common problems associated with large genomes have been repetitive sequences (generally around 50% of a vertebrate genome), gene families and genetic polymorphisms, all of which can cause errors in assembly. Genome assembly is still a problem, requiring a combination of parallel computing and hard work from teams of manual annotators, and there is a need for a step change in the algorithms and approaches used to assemble a sequence. The bovine genome is the latest in a series of large-scale sequencing projects based on the conventional automated Sanger methods. It illustrates many of these problems and provides some solutions [2, 3].
There are two bovine genome assemblies: BCM4 from Baylor College and UMD2 from the University of Maryland. Both assemblies are based on the sequence data generated by the Baylor genome sequencing center. How do they compare? Which is the more accurate?
BCM4 is the latest assembly from a series – BCM1 (2004), BCM2 (2005), and BCM3.1 (2006) – which claims to be more accurate, with greater coverage and fewer misassemblies than before. The earlier inaccuracies were due to the assemblies having been largely based on whole-genome shotgun (WGS) data alone: because of the sizes of the fragments generated in WGS sequencing, this is highly prone to errors caused by the repeated sequences that pose a significant problem in genome assemblies. BCM4 by contrast was assembled by combining WGS reads (sequencing of 30 million reads) with the reads and fingerprinted contig (FPC) maps of large genomic inserts cloned into bacterial artificial chromosomes (BACs). The large inserts allow the smaller fragments to be correctly assembled with fewer mistakes due to repetitive sequences [2]. The WGS reads ensure coverage of the whole genome. In addition, through the development of a new assembler (Atlas) the Baylor team was able to integrate these sequences with other data, from FPC BAC maps, genetic maps and chromosome assignments. The sequence data themselves were based on a sire and daughter, mostly on the daughter's DNA. Therefore, the coverage of the sex chromosomes X and Y is not as good as that of the autosomes, especially in the case of the Y chromosome, of which only a small amount of DNA was available from the single Y chromosome of the sire, whereas the two animals together provided three X chromosomes (and of course four of each of the autosomes) [2, 3].
For BCM4, more than 90% of sequences have been assigned to a specific chromosome and total sequence assembled is 2.54 Giga base-pairs (Gbp). On the basis of overlaps with 1.04 million expressed sequenced tags (ESTs), the gene coverage is estimated at 95%. Comparisons between 73 fully sequenced BAC clones showed few misassemblies and more than 92% coverage. Finally, 99.2% of 17,482 SNPs have been mapped correctly onto the BCM4 assembly. The sequence of the bovine MHC (BoLA) provides a critical test of accuracy [4], as it contains many polymorphic gene families densely clustered on chromosome 23 and automated genome assembly software is prone to errors of deletion and duplication in such regions. The paper by Brinkmeyer-Langford et al. [4] shows extremely good agreement between the radiation hybrid (RH) map derived by mapping DNA markers from this region on RH panels and the BCM4 sequence assembly.
The University of Maryland's assembly, UMD2, is based on the same raw data as BCM4 and integrates a wider range of external data to improve and validate the final sequence assembly [3]. In particular, it uses comparison between the cattle and human genome sequences to orientate or place cattle contigs when the data from the cattle genome alone cannot. It has therefore been able to assemble more sequence (2.86 Gbp, with 91% of sequences assigned to a specific chromosome and some of the Y), with fewer gaps (for example UMD2 assigned 136 Mb to the bovine X chromosome and BCM4 only 83 Mb), fewer misassemblies and with SNP errors corrected (BCM4 may have threefold more errors than UMD2).
Accuracy was also improved in the UMD2 assembly by paired-end reads for regions containing segmental duplications, gene families and gene polymorphisms, where assembly is particularly error-prone. In a paired-end read, about 500 bp are sequenced at each end of a large BAC insert to place the insert on the genome map. If the length of the BAC insert fails to correspond to the distance between the sequences matching the two ends of the insert on the genome assembly, then a duplication or a deletion must have been introduced in the assembly. As a result of this analysis, the UMD2 group report only 662 segmental duplications compared with 3,098 for BCM4. Duplications can be due to copy-number variation, a focus of much current interest because of its association, in different cases, with genetic disease and with disease resistance. However, quantification of WGS reads in these regions did not suggest any over-representation that might indicate increased copy number. WGS should be over- or under-represented in the corresponding BCM4 sequences where the two assemblies disagree, and this should clearly be checked.
The use by the UMD2 assembly of comparative maps between cattle and human allowed more sequence to be assembled, but somewhat undermines conclusions based on human-bovine sequence comparisons. The data can, however, now be used to highlight potential problem areas or predict specific arrangements and guide more sequencing to generate bovine data to confirm these predictions. These studies will presumably go ahead in the coming months at Maryland, Baylor and elsewhere.
What these assemblies also illustrate is the benefit of and need for community support for the final success of a genome project. The cattle community provided DNA samples of breeds, chromosome assignments of specific contigs, genetic linkage maps, BAC and FPC BAC maps, EST libraries for gene prediction and genome annotations [1] for gene and protein predictions. However, the integration of datasets from multiple sources posed a substantial challenge for the bioinformaticians at Baylor College and Maryland in the absence of the genome sequence as a reference point.
Finally, we should ask what we can expect in the future. The availability of ultra-high-throughput sequence technologies will provide more raw sequence data, which could be used to fill in gaps, for example in regions not cloned in the current assembly. The extra reads would also increase the quality and number of SNPs detected by comparing several breeds, and increase the accuracy of sequence divergence and diversity estimates by providing some assurance that apparent SNPs are really SNPs and not sequencing errors.