Of mice and men: phylogenetic footprinting aids the discovery of regulatory elements
© BioMed Central Ltd 2003
Published: 6 June 2003
Phylogenetic footprinting is an approach to finding functionally important sequences in the genome that relies on detecting their high degrees of conservation across different species. A new study shows how much it improves the prediction of gene-regulatory elements in the human genome.
It has been a great challenge for biologists to understand the complicated and often myriad mechanisms of gene regulation. The recent success of genome sequencing projects [1, 2], combined with very effective gene-prediction algorithms, has generated abundant gene sequences, but our understanding of gene regulation has remained very limited. In human and other higher eukaryotes, gene expression is modulated by the binding of various transcription factors onto cis-regulatory regions of a gene. Binding of different combinations of transcription factors may result in a gene being expressed in different tissue types or at different developmental stages. To fully understand a gene's function, therefore, it is essential to identify the transcription factors that regulate the gene and the corresponding transcription-factor-binding sites (TFBSs) within the DNA sequence. Traditionally, these regulatory sites were determined by labor-intensive wet-lab techniques such as DNAse footprinting or gel-shift assays ; several online databases, such as TRRD, COMPEL and TRANSFAC [4, 5] have been constructed to store experimentally determined TFBSs. Now, Lenhard and colleagues  describe a new addition to the toolkit for TFBS prediction.
Given a PWM and a reliable scoring function, one can scan genomic DNA sequences and identify potential TFBSs. But because TFBSs are highly degenerate, the majority of predicted sites are 'false positives' that have no biological significance . Several strategies have therefore been developed to reduce the false-positive rate; these include combining predictions with gene-expression data  or using prior knowledge of gene co-regulation . Another approach is to take advantage of the fact that genes are often regulated by multiple transcription factors, so potential TFBSs tend to be clustered or adjacent to each other . Alternatively, some researchers have tried to create more precise and sensitive tools for local sequence alignment and pattern discovery [12, 13].
Phylogenetic footprinting was first performed by visually examining the alignment of orthologous sequences; then, automated computer programs were developed to assist the process. In this issue of Journal of Biology, Lenhard, Sandelin and colleagues describe their most recent success in predicting TFBSs by comparative genome analysis . They also introduce an interactive, web-based computational platform, ConSite , which allows users to do their own phylogenetic footprinting.
The power of any TFBS prediction algorithm that uses PWMs depends on the quality of the matrix models that it uses, since the matrices represent an abstraction of experimentally verified TFBSs. Lenhard and colleagues  collected TFBSs from both in vivo and in vitro assays and used an improved motif discovery algorithm, ANN-Spec , to construct over 100 distinct and high-quality TFBS profile matrices. These comprehensive profiles were collected into an online database JASPAR , which is freely available to the scientific community. Users of ConSite can either provide an existing alignment of two orthologous sequences or input just the sequences alone and the program will generate the alignment. The program then scans the individual sequences for potential TFBSs and compares the potential sites between the aligned sequences. Only those conserved sites that are present in both sequences and also, more importantly, are located in equivalent positions in the two aligned sequences, are selected and reported in the output. The remainder of the sites, which are not conserved between the two species, are considered to be false positives and are eliminated.
This phylogenetic filtering procedure significantly improves the power of TFBS prediction, as is demonstrated by an example described in detail in the article by Lenhard et al. . The authors compared the human β-globin promoter sequence with the orthologous sequences from mouse and cow; this dramatically reduced the false-positive prediction of TFBSs and they were able to identify a previously documented regulatory site. The authors also studied a larger set of human-mouse gene pairs and compared the results predicted by ConSite with the previously verified regulatory sites. On average, phylogenetic footprinting improved the selectivity of TFBS prediction by 85% compared to using matrix models alone, and could detect the majority of verified sites. When compared with other available systems, ConSite has a flexible and easy-to-use web interface. Users of the website can choose to search for binding sites for any numbers of transcription factors or can even provide their own defined PWMs. The entire procedure and the output graphs can be modulated by many user-specified parameters such as the extent of required conservation (cut-off), and the length of sequence to search (window size).
It is becoming evident that comparative genome analysis is very powerful and will be of use not only for genome annotation but also as an adjunct to more traditional disciplines, such as molecular biology and genetics. Just like the sequence-alignment programs that emerged in the early 1990s, ConSite and other similar programs [23, 24] will prove very valuable and timely research tools for the scientific community. Many new research directions are currently being pursued in this area; for example, pair-wise sequence comparisons can be expanded to include multiple species and to make use of additional information, such as evolutionary distance and phylogenetic relationships . More precise and effective sequence alignment programs have been created to handle genome-scale sequences [26, 27]. In addition to the human-mouse comparisons, some researchers are also proposing cross-species comparison between human and other primates, which has been described as 'phylogenetic shadowing' . This approach complements human-rodent comparisons and will detect primate-specific regulatory elements (see Figure 2). On the 'wet' experimental front, recent developments include microarray-based technologies such as 'ChIP-chip', which combines chromatin immunoprecipitation (ChIP) with analysis of the precipitated DNA on a microarray (chip), to detect TFBSs within a whole genome . It can be imagined that, with the emergence of more mammalian genome sequences in the near future, we can finally identify all the gene regulatory elements in the human genome and use them as a blueprint for understanding the mysteries of gene regulation.
- Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, Fitzhugh W, et al: Initial sequencing and analysis of the human genome. Nature. 2001, 409: 860-921. 10.1038/35057062.View ArticlePubMedGoogle Scholar
- Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, et al: Initial sequencing and comparative analysis of the mouse genome. Nature. 2002, 420: 520-562. 10.1038/nature01262.View ArticlePubMedGoogle Scholar
- Galas DJ, Schmitz A: DNAse footprinting: a simple method for the detection of protein-DNA binding specificity. Nucleic Acids Res. 1978, 5: 3157-3170.PubMed CentralView ArticlePubMedGoogle Scholar
- Heinemeyer T, Wingender E, Reuter I, Hermjakob H, Kel AE, Kel OV, Ignatieva EV, Ananko EA, Podkolodnaya OA, Kolpakov FA, et al: Databases on transcriptional regulation: TRANSFAC, TRRD and COMPEL. Nucleic Acids Res. 1998, 26: 362-367. 10.1093/nar/26.1.362.PubMed CentralView ArticlePubMedGoogle Scholar
- Wingender E, Chen X, Fricke E, Geffers R, Hehl R, Liebich I, Krull M, Matys V, Michael H, Ohnhauser R, et al: The TRANSFAC system on gene expression regulation. Nucleic Acids Res. 2001, 29: 281-283. 10.1093/nar/29.1.281.PubMed CentralView ArticlePubMedGoogle Scholar
- Lenhard B, Sandelin A, Mendoza L, Engström P, Jareborg N, Wasserman WW: Identification of conserved regulatory elements by comparative genome analysis. J Biol. 2003, 2: 13-10.1186/1475-4924-2-13.PubMed CentralView ArticlePubMedGoogle Scholar
- Frech K, Herrmann G, Werner T: Computer-assisted prediction, classification, and delimitation of protein binding sites in nucleic acids. Nucleic Acids Res. 1993, 21: 1655-1664.PubMed CentralView ArticlePubMedGoogle Scholar
- Fickett JW, Wasserman WW: Discovery and modeling of transcriptional regulatory regions. Curr Opin Biotechnol. 2000, 11: 19-24. 10.1016/S0958-1669(99)00049-X.View ArticlePubMedGoogle Scholar
- Zhu Z, Pilpel Y, Church GM: Computational identification of transcription factor binding sites via a transcription-factor-centric clustering (TFCC) algorithm. J Mol Biol. 2002, 318: 71-81. 10.1016/S0022-2836(02)00026-8.View ArticlePubMedGoogle Scholar
- Kielbasa SM, Korbel JO, Beule D, Schuchhardt J, Herzel H: Combining frequency and positional information to predict transcription factor binding sites. Bioinformatics. 2001, 17: 1019-1026. 10.1093/bioinformatics/17.11.1019.View ArticlePubMedGoogle Scholar
- Berman BP, Nibu Y, Pfeiffer BD, Tomancak P, Celniker SE, Levine M, Rubin GM, Eisen MB: Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc Natl Acad Sci USA. 2002, 99: 757-762. 10.1073/pnas.231608898.PubMed CentralView ArticlePubMedGoogle Scholar
- Quandt K, Frech K, Karas H, Wingender E, Werner T: MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res. 1995, 23: 4878-4884.PubMed CentralView ArticlePubMedGoogle Scholar
- Hertz GZ, Stormo GD: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics. 1999, 15: 563-577. 10.1093/bioinformatics/15.7.563.View ArticlePubMedGoogle Scholar
- Batzoglou S, Pachter L, Mesirov JP, Berger B, Lander ES: Human and mouse gene structure: comparative analysis and application to exon prediction. Genome Res. 2000, 10: 950-958. 10.1101/gr.10.7.950.PubMed CentralView ArticlePubMedGoogle Scholar
- Gumucio DL, Heilstedt-Williamson H, Gray TA, Tarle SA, Shelton DA, Tagle DA, Slightom JL, Goodman M, Collins FS: Phylogenetic footprinting reveals a nuclear protein which binds to silencer sequences in the human gamma and epsilon globin genes. Mol Cell Biol. 1992, 12: 4919-4929.PubMed CentralView ArticlePubMedGoogle Scholar
- Duret L, Bucher P: Searching for regulatory elements in human noncoding sequences. Curr Opin Struct Biol. 1997, 7: 399-406. 10.1016/S0959-440X(97)80058-9.View ArticlePubMedGoogle Scholar
- Oeltjen JC, Malley TM, Muzny DM, Miller W, Gibbs RA, Belmont JW: Large-scale comparative sequence analysis of the human and murine Bruton's tyrosine kinase loci reveals conserved regulatory domains. Genome Res. 1997, 7: 315-329.PubMedGoogle Scholar
- Hardison RC: Conserved noncoding sequences are reliable guides to regulatory elements. Trends Genet. 2000, 16: 369-372. 10.1016/S0168-9525(00)02081-3.View ArticlePubMedGoogle Scholar
- Wasserman WW, Palumbo M, Thompson W, Fickett JW, Lawrence CE: Human-mouse genome comparisons to locate regulatory sites. Nat Genet. 2000, 26: 225-228. 10.1038/79965.View ArticlePubMedGoogle Scholar
- ConSite. [http://phylofoot.org/]
- Workman CT, Stormo GD: ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. Pac Symp Biocomput. 2000, 467-478.Google Scholar
- JASPAR. [http://forkhead.cgb.ki.se/JASPAR/]
- Loots GG, Ovcharenko I, Pachter L, Dubchak I, Rubin EM: rVista for comparative sequence-based discovery of functional transcription factor binding sites. Genome Res. 2002, 12: 832-839. 10.1101/gr.225502. Article published online before print in April 2002.PubMed CentralView ArticlePubMedGoogle Scholar
- Aerts S, Thijs G, Coessens B, Staes M, Moreau Y, De Moor B: Toucan: deciphering the cis-regulatory logic of coregulated genes. Nucleic Acids Res. 2003, 31: 1753-1764. 10.1093/nar/gkg268.PubMed CentralView ArticlePubMedGoogle Scholar
- Blanchette M, Schwikowski B, Tompa M: Algorithms for phylogenetic footprinting. J Comput Biol. 2002, 9: 211-223. 10.1089/10665270252935421.View ArticlePubMedGoogle Scholar
- Schwartz S, Zhang Z, Frazer KA, Smit A, Riemer C, Bouck J, Gibbs R, Hardison R, Miller W: PipMaker – a web server for aligning two genomic DNA sequences. Genome Res. 2000, 10: 577-586. 10.1101/gr.10.4.577.PubMed CentralView ArticlePubMedGoogle Scholar
- Bray N, Dubchak I, Pachter L: AVID: A global alignment program. Genome Res. 2003, 13: 97-102. 10.1101/gr.789803.PubMed CentralView ArticlePubMedGoogle Scholar
- Boffelli D, McAuliffe J, Ovcharenko D, Lewis KD, Ovcharenko I, Pachter L, Rubin EM: Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science. 2003, 299: 1391-1394. 10.1126/science.1081331.View ArticlePubMedGoogle Scholar
- Horak CE, Mahajan MC, Luscombe NM, Gerstein M, Weissman SM, Snyder M: GATA-1 binding sites mapped in the betaglobin locus by using mammalian ChIP-chip analysis. Proc Natl Acad Sci USA. 2002, 99: 2924-2929. 10.1073/pnas.052706999.PubMed CentralView ArticlePubMedGoogle Scholar