Comparison of orthologous genomic sequences is an effective method for the identification of segments likely to mediate a sequence-specific biological function. The performance of phylogenetic footprinting methods for the detection of TFBSs is dependent upon multiple factors, including the alignment algorithm, the available binding profiles and the evolutionary distance between the target sequences. Two key data resources are introduced in this study: a novel collection of transcription-factor binding profiles compiled from the biological research literature and a reference test set for phylogenetic footprinting methods. The ConSite web interface to the system facilitates user control, an essential feature for users studying diverse genomes.
The binding profile collection is an important resource for bioinformatics projects. Like the TFBS programming system , the JASPAR profile collection is available freely to the research community . The profiles are non-redundant and are restricted to those cases for which sufficient binding data were available to generate a meaningful representation of the binding specificity of a TF. Continuing expansion of the collection is anticipated, given the strong research progress in modeling DNA binding sites .
The new phylogenetic footprinting reference collection of TFBSs allows for quantitative assessment of the performance of new methods. This is the largest collection of its kind available for broad use. In our study, we could detect around 68% of the experimentally defined TFBSs in conserved segments (at 65% relative matrix score threshold; see Figure 2). This differs slightly from the outcome of a study of conservation properties proximal to TFBSs , which indicated that only around 50% of sites are situated in conserved regions. There are several key factors that may account for this difference. The procedures for defining the collections were different. For instance, the amount of flanking sequence used for mapping the locations of the sites onto genome sequences was lower in the previous study. These short fragments were mapped onto a commercial human genome assembly and the mapped regions compared to shotgun-generated fragments of mouse genomes from multiple strains. The alignment procedures were also different, with the older set aligned by BLAST  and assessed by a stringent similarity threshold (> 80% identity over 40 bp). There was no exclusion of pseudogenes or paralogous genes indicated in the previous study, which would result in decreased sensitivity due to the erroneous application of phylogenetic footprinting to genes evolving under distinct evolutionary pressures.
While the work presented here focuses on mammalian sequence comparisons, there is no limitation within the ConSite system precluding studies of other organisms (the ConSite website includes samples with insect and nematode sequences). In the future it will be important to develop methods capable of analyzing multiple genomic sequences in parallel, but this is a non-trivial task. Such a system must allow for weighting based on evolutionary distances to preserve sensitivity, and requires advances in multiple sequence alignment algorithms. Some steps in this direction are beginning to emerge [31, 32].
No single resource offers the same set of functions or integration as ConSite. The only similarly scoped resource is the recently published rVista , which searches for TFBSs in a reference sequence and filters the results for sites in regions of high conservation with respect to a second genomic sequence. Unlike rVista, ConSite searches both sequences for TFBSs, for better specificity, and enables easy modification of the parameters for interactive analysis, as well as providing different output formats to aid the design and interpretation of experiments in molecular biotechnology. ConSite's publicly available collection of transcription-factor profiles allows users to access information about the TFs associated with the predicted sites. Given that many users focus on a specific TF and have developed high-quality models of their own, ConSite also allows for user-defined profiles.
We present an algorithm that uses phylogenetic footprinting to identify potential TFBSs. The approach to identifying regulatory elements presented here yields greater specificity than previous approaches that were based purely on profile searches of single genomic sequences. In short, using phylogenetic footprinting to filter the computational predictions significantly reduces noise at the price of a slight decrease in sensitivity. The web application we present enables researchers to utilize this approach in a straightforward manner. With the culmination of the human and mouse genome sequencing efforts [34, 35], we believe this new algorithm will be of significant use in the ongoing efforts to ascribe function to non-coding sequences.