The cattle genome reveals its secrets
© BioMed Central Ltd 2009
Published: 24 April 2009
Skip to main content
© BioMed Central Ltd 2009
Published: 24 April 2009
The domesticated cow is the latest farm animal to have its genome sequenced and deciphered. The members of the Bovine Genome Consortium have published a series of papers on the assembly and what the sequence reveals so far about the biology of this ruminant and the consequences of its domestication.
The Bovine Genome Project represents a complex collaborative effort between multiple groups and funding from the United States, Canada, France, United Kingdom, New Zealand and Australia.
Undoubtedly the current bovine genome sequence will be improved in both its sequence coverage and its annotation, but this draft sequence will form the basis for cattle genetics and genomics for the next 20 years or more.
So what have we learned?
The technology for generating raw sequence data has advanced rapidly over the past 35 years, starting with Sanger sequencing in the 1970s, automated fluorescent Sanger sequencing in the 1980s and, recently, ultra-high- throughput methods based on the parallel sequencing platforms produced by 454, Illumina, and ABI. However, the scale of these advances has not been matched by new algorithms and tools for sequence assembly, particularly for large genomes. Common problems associated with large genomes have been repetitive sequences (generally around 50% of a vertebrate genome), gene families and genetic polymorphisms, all of which can cause errors in assembly. Genome assembly is still a problem, requiring a combination of parallel computing and hard work from teams of manual annotators, and there is a need for a step change in the algorithms and approaches used to assemble a sequence. The bovine genome is the latest in a series of large-scale sequencing projects based on the conventional automated Sanger methods. It illustrates many of these problems and provides some solutions [2–3].
There are two bovine genome assemblies: BCM4 from Baylor College and UMD2 from the University of Maryland. Both assemblies are based on the sequence data generated by the Baylor genome sequencing center. How do they compare? Which is the more accurate?
BCM4 is the latest assembly from a series – BCM1 (2004), BCM2 (2005), and BCM3.1 (2006) – which claims to be more accurate, with greater coverage and fewer misassemblies than before. The earlier inaccuracies were due to the assemblies having been largely based on whole-genome shotgun (WGS) data alone: because of the sizes of the fragments generated in WGS sequencing, this is highly prone to errors caused by the repeated sequences that pose a significant problem in genome assemblies. BCM4 by contrast was assembled by combining WGS reads (sequencing of 30 million reads) with the reads and fingerprinted contig (FPC) maps of large genomic inserts cloned into bacterial artificial chromosomes (BACs). The large inserts allow the smaller fragments to be correctly assembled with fewer mistakes due to repetitive sequences . The WGS reads ensure coverage of the whole genome. In addition, through the development of a new assembler (Atlas) the Baylor team was able to integrate these sequences with other data, from FPC BAC maps, genetic maps and chromosome assignments. The sequence data themselves were based on a sire and daughter, mostly on the daughter's DNA. Therefore, the coverage of the sex chromosomes X and Y is not as good as that of the autosomes, especially in the case of the Y chromosome, of which only a small amount of DNA was available from the single Y chromosome of the sire, whereas the two animals together provided three X chromosomes (and of course four of each of the autosomes) [2, 3].
For BCM4, more than 90% of sequences have been assigned to a specific chromosome and total sequence assembled is 2.54 Giga base-pairs (Gbp). On the basis of overlaps with 1.04 million expressed sequenced tags (ESTs), the gene coverage is estimated at 95%. Comparisons between 73 fully sequenced BAC clones showed few misassemblies and more than 92% coverage. Finally, 99.2% of 17,482 SNPs have been mapped correctly onto the BCM4 assembly. The sequence of the bovine MHC (BoLA) provides a critical test of accuracy , as it contains many polymorphic gene families densely clustered on chromosome 23 and automated genome assembly software is prone to errors of deletion and duplication in such regions. The paper by Brinkmeyer-Langford et al.  shows extremely good agreement between the radiation hybrid (RH) map derived by mapping DNA markers from this region on RH panels and the BCM4 sequence assembly.
The University of Maryland's assembly, UMD2, is based on the same raw data as BCM4 and integrates a wider range of external data to improve and validate the final sequence assembly . In particular, it uses comparison between the cattle and human genome sequences to orientate or place cattle contigs when the data from the cattle genome alone cannot. It has therefore been able to assemble more sequence (2.86 Gbp, with 91% of sequences assigned to a specific chromosome and some of the Y), with fewer gaps (for example UMD2 assigned 136 Mb to the bovine X chromosome and BCM4 only 83 Mb), fewer misassemblies and with SNP errors corrected (BCM4 may have threefold more errors than UMD2).
Accuracy was also improved in the UMD2 assembly by paired-end reads for regions containing segmental duplications, gene families and gene polymorphisms, where assembly is particularly error-prone. In a paired-end read, about 500 bp are sequenced at each end of a large BAC insert to place the insert on the genome map. If the length of the BAC insert fails to correspond to the distance between the sequences matching the two ends of the insert on the genome assembly, then a duplication or a deletion must have been introduced in the assembly. As a result of this analysis, the UMD2 group report only 662 segmental duplications compared with 3,098 for BCM4. Duplications can be due to copy-number variation, a focus of much current interest because of its association, in different cases, with genetic disease and with disease resistance. However, quantification of WGS reads in these regions did not suggest any over-representation that might indicate increased copy number. WGS should be over- or under-represented in the corresponding BCM4 sequences where the two assemblies disagree, and this should clearly be checked.
The use by the UMD2 assembly of comparative maps between cattle and human allowed more sequence to be assembled, but somewhat undermines conclusions based on human-bovine sequence comparisons. The data can, however, now be used to highlight potential problem areas or predict specific arrangements and guide more sequencing to generate bovine data to confirm these predictions. These studies will presumably go ahead in the coming months at Maryland, Baylor and elsewhere.
What these assemblies also illustrate is the benefit of and need for community support for the final success of a genome project. The cattle community provided DNA samples of breeds, chromosome assignments of specific contigs, genetic linkage maps, BAC and FPC BAC maps, EST libraries for gene prediction and genome annotations  for gene and protein predictions. However, the integration of datasets from multiple sources posed a substantial challenge for the bioinformaticians at Baylor College and Maryland in the absence of the genome sequence as a reference point.
Finally, we should ask what we can expect in the future. The availability of ultra-high-throughput sequence technologies will provide more raw sequence data, which could be used to fill in gaps, for example in regions not cloned in the current assembly. The extra reads would also increase the quality and number of SNPs detected by comparing several breeds, and increase the accuracy of sequence divergence and diversity estimates by providing some assurance that apparent SNPs are really SNPs and not sequencing errors.
The availability of a cattle genome sequence with more than 95% coverage is an excellent resource for comparative and evolutionary biologists. In addition, physiologists and biochemists will be interested in the unique biology of ruminants specialized for converting low-grade forage into energy-rich fat, milk and muscle.
Elsik and colleagues  have led the way to annotate the genome, to give it meaning in terms of genomic structure, genes and proteins. This was achieved using a combination of automated pipelines and 4,000 manual annotations, which were made as part of a 'Bovine Annotation Jamboree' as well as by dedicated teams of annotators. Analysis predicted 26,835 genes, of which 82% were validated from external data sources. This suggests that the bovine genome encodes at least 22,000 genes, which is broadly in line with gene counts in all other mammals. In addition, 496 microRNAs were detected, including 135 novel sequences.
Multiple species comparisons between the cow and other mammals define a core set of 14,345 orthologous genes, 1,217 of which are specific to placental mammals and missing in marsupials and monotremes. Comparative mapping with other mammalian genomes defines 124 evolutionary breakpoints, mostly associated with repetitive sequences and segmental duplications. Interestingly, genes associated with lactation and immune responses are also associated with these breakpoints. Does this suggest a selective advantage or simply a mechanism for expanding these gene families?
Comparisons between human and bovine coding regions aimed at identifying genes under strong selection define 2,210 genes with elevated dN/dS ratios (a measure of selective constraint on proteins). Seventy-one genes have dN/dS >1, and among these, not surprisingly, genes with roles in reproduction, lactation and fat metabolism are over-represented [1, 5–6]. More surprisingly, they include genes encoding proteins of the immune system. These are the genes that distinguish the ruminants from other mammals, and may reflect special needs of ruminants, which retain the low-grade food they ingest, along with any associated pathogens, for up to a day in the rumen before releasing it into the intestines from which infectious organisms are readily expelled.
One of the novel features of the Bovine Genome Project has been to use the sequence to examine the evolution and process of domestication of cattle. The aims of these studies were to uncover more about phylogenetic relationships amongst the Bovinae and the importance of natural and artificial selection, and to identify genes or genomic regions that have been critical in the domestication process – the so called 'signatures of selection'.
From the analysis of ancestral mutations , it appears that domesticated cattle populations are able to maintain a high load of unfavorable mutations. This is probably a consequence of the domestication process itself. The selection of specific cattle breeds has been through many small populations, and thus bottlenecks, which may favor the chance survival of unfavorable alleles. Survival of potentially deleterious alleles will of course be further favored by strong artificial selection: for example, the double-muscling genes favored for beef production would almost certainly be lost in the wild through natural selection.
Like other genome projects, the cattle project also has a parallel SNP discovery pipeline . The reference Hereford genome has been compared with six other breeds, with the identification of 37,470 SNPs polymorphic in all breeds. An immediate practical outcome of this SNP project is the definition of a set of 50 SNPs that could be used for unique parentage assignment and proof of identity.
Recently (in the last 10,000 years), population sizes have fallen sharply to small numbers, with many bottlenecks due to domestication and artificial selection for milk and beef. The decline in diversity seen in some breeds is a matter for concern. But even in these contracted populations, the pattern of linkage disequilibrium suggests that cattle started from a very large base 1–2 Mya with ancestral populations of 90,000 or more .
Various measures of genomic selection have been used (iHS, FST and CLR) to map regions of selective sweep on chromosomes 2, 6 and 14 . Selective sweep is the term used for the presence of genes on either side of a selected gene that are unusually conserved by virtue of their linkage to the selected gene. These regions in the bovine genome are, not surprisingly, associated with genes with a function in muscling (MSTN), milk yield and composition (ABCG2) and energy homeostasis (R3HDM1, LCT). The evidence of selection in these regions correlates with genes associated with efficiency of food utilization, immunity and behavior. It is possible that under domestication, mutations at these genes have been selected to produce animals more able to resist the infectious diseases prevalent in herds and showing the docile behavior suited to human husbandry .
Bovine genome coverage in BioMed Central:
Burt DW: The cattle genome reveals its secrets http://jbiol.com/content/8/4/36
Capuco AV, Akers RM: The origin and evolution of lactation http://jbiol.com/content/8/4/37
Church DM, Hillier LW: Back to Bermuda: how is science best served? http://genomebiology.com/2009/10/4/105
DWB is supported by the Biotechnology and Biological Sciences Research Council and the University of Edinburgh.