Identification of conserved regulatory elements by comparative genome analysis

Lenhard, Boris; Sandelin, Albin; Mendoza, Luis; Engström, Pär; Jareborg, Niclas; Wasserman, Wyeth W

doi:10.1186/1475-4924-2-13

Research article
Open access
Published: 22 May 2003

Identification of conserved regulatory elements by comparative genome analysis

Boris Lenhard¹,
Albin Sandelin¹,
Luis Mendoza¹^nAff2,
Pär Engström¹,
Niclas Jareborg¹^nAff3 &
…
Wyeth W Wasserman¹^nAff4

Journal of Biology volume 2, Article number: 13 (2003) Cite this article

43k Accesses
133 Citations
3 Altmetric
Metrics details

Abstract

Background

For genes that have been successfully delineated within the human genome sequence, most regulatory sequences remain to be elucidated. The annotation and interpretation process requires additional data resources and significant improvements in computational methods for the detection of regulatory regions. One approach of growing popularity is based on the preferential conservation of functional sequences over the course of evolution by selective pressure, termed 'phylogenetic footprinting'. Mutations are more likely to be disruptive if they appear in functional sites, resulting in a measurable difference in evolution rates between functional and non-functional genomic segments.

Results

We have devised a flexible suite of methods for the identification and visualization of conserved transcription-factor-binding sites. The system reports those putative transcription-factor-binding sites that are both situated in conserved regions and located as pairs of sites in equivalent positions in alignments between two orthologous sequences. An underlying collection of metazoan transcription-factor-binding profiles was assembled to facilitate the study. This approach results in a significant improvement in the detection of transcription-factor-binding sites because of an increased signal-to-noise ratio, as demonstrated with two sets of promoter sequences. The method is implemented as a graphical web application, ConSite, which is at the disposal of the scientific community at http://www.phylofoot.org/.

Conclusions

Phylogenetic footprinting dramatically improves the predictive selectivity of bioinformatic approaches to the analysis of promoter sequences. ConSite delivers unparalleled performance using a novel database of high-quality binding models for metazoan transcription factors. With a dynamic interface, this bioinformatics tool provides broad access to promoter analysis with phylogenetic footprinting.

Introduction

The information in genes generally flows from static DNA sequences to active proteins via an RNA intermediary. Depending upon the cellular context of physiological, developmental and environmental inputs, genes are selectively activated via regulatory sequences in the DNA. At their foundation, transcriptional regulatory regions in the human genome are characterized by the presence of target binding sites for transcription factors (TFs). Knowledge of the identity of a mediating TF can give important insights into the function of a gene via inference of the processes or conditions that lead to expression. Research in bioinformatics has developed reliable methods to model the DNA binding specificity of individual TFs. As most eukaryotic TFs tolerate considerable sequence variation in their target sites, simple consensus sequences fail to represent the specificity of binding factors. This realization led to the development of the quantitative representation of binding specificity with position weight matrices [1]. Such matrices can be highly accurate in identifying in vitro target sequences [2], but are insufficiently specific in the identification of sites with in vivo function to provide meaningful predictions [3]. The in vivo binding specificity of a TF depends upon additional properties not modeled by a weight matrix, such as protein-protein interactions, chromatin superstructures and TF concentrations.

Comparison of orthologous gene sequences has emerged as a powerful tool in genome analysis. 'Phylogenetic footprinting' [4] provides complementary data to computational predictions, as sequence conservation over evolution highlights segments in genes likely to mediate biological function. The utility of phylogenetic footprinting extends to a broad array of annotation challenges, but it is particularly suited to the identification of sequences with a functional role in the regulation of gene transcription [5, 6]. Despite specific successes [7] in studies of gene regulation, the central algorithms for phylogenetic footprinting remain to be optimized and are thus the focus of continuing research. In particular, new algorithms based on phylogenetic footprinting have been presented for the alignment of genomic sequences, data visualization and the identification of exons [8, 9]. Algorithms for the analysis of regulatory sequences have addressed the detection of over-represented patterns in the promoters of co-regulated genes [10], and the improved discrimination of regulatory modules [11], as well as comparative studies of orthologous promoters across collections of microbial genomes [12, 13].

Here, we introduce a highly specific algorithm, ConSite, for the detection of transcription-factor-binding sites (TFBSs) that is based on phylogenetic footprinting. Three central components underlie the advance: first, a non-redundant set of transcription-factor binding models; second, a suitable alignment algorithm for orthologous non-coding genomic sequences; and third, modular software for the integration of binding-site predictions with analysis of sequence similarity. We show that our approach results in an increased specificity of predicted TFBSs as a result of a significant reduction of noise. The ConSite algorithm is thus particularly suited to the analysis of pairs of orthologous genomic sequences with limited or no experimental annotation of regulatory elements.

Results

A non-redundant set of high-quality transcription-factor binding models

Potential TFBSs can be identified within a genomic sequence by well-studied computational approaches based on quantitative profiles describing the binding site characteristics for TFs. The quality of matrix models is dependent upon the number of biochemically determined target sites. While the binding specificities of few eukaryotic TFs are described richly in the literature by multiple in vivo functional sites, a significant number of TF binding profiles have been produced through the application of in vitro target-site detection assays [14]. We collected available data of both types from the biological literature to construct 108 non-redundant high-quality profiles [15]. The profiles are derived from the super-classes vertebrates, insects or plants, but the majority (65%) of matrices model the binding of human or rodent factors. As the majority of the profiles originate from site-selection assays, the average number of TFBSs contributing to each profile is a robust 31.2 sites per model. Information content, in terms of bits of information, is commonly used within bioinformatics to describe the overall specificity of a profile. The models in the collection range in information content from 5.6 to 26.2 bits, with an average of 12.1 bits. All models are hyperlinked to corresponding sequence accession numbers and the PubMed abstract for the article describing the binding study.

Integrating binding-site prediction with analysis of sequence conservation in orthologous genomic sequences

Phylogenetic footprinting provides data complementary to binding-site predictions, for the analysis of gene regulation. The simple hypothesis that motivates phylogenetic footprinting is that important functional sequences will be under selective pressure to be retained over moderate periods of evolution. The classification of sequences as conserved or freely evolving (as proposed by Kimura [16]) is not yet a quantitative process. It should be noted that evolutionary rates vary dramatically between genes and the choice of species is an important consideration in phylogenetic footprinting studies. Too great an evolutionary distance can result in regulatory alterations or difficulty in aligning short patches of similarity between long sequences. Inadequate evolutionary distance does not significantly improve the overall specificity of predictions. We have developed the ConSite method to integrate phylogenetic footprinting with profile-based predictions of TFBSs, in order to achieve specific predictions of functional regulatory elements in genes. As an example of the influence of species selection on the qualitative performance of the system, the human β globin promoter was compared to a diverse range of orthologs (Figure 1).

In this report, we focus on human-rodent comparisons, as several studies have suggested that only a small portion (17–20%) of non-coding regions are conserved (on average) at this evolutionary distance [10, 17]. Furthermore, similarity is punctuated, with distinguishable segments of high similarity flanked by regions of apparently random sequence (roughly 33% nucleotide identity is observed between random genomic sequences, with wide variations dependent upon the applied alignment algorithm, settings, and sequence characteristics [18]). This compartmentalized pattern of similarity is consistent with the emerging emphasis on multiple TFs binding to locally dense site clusters termed regulatory modules [19], which suggests that distinct blocks of sequence are required for transcriptional regulation. In order to identify segments of preferential conservation in orthologous genomic sequences, a suitable set of classification criteria must be defined. As similarity or rates of evolution vary widely across genomic sequences, no single threshold will be perfectly suited. We elected to focus the algorithm on segments of high similarity. This refers to sliding windows of fixed size over the alignment, retaining only those where the sequence identity exceeds a default or user-specified threshold. If a cDNA sequence is available, the analysis program can exclude from consideration binding-site predictions situated within exons present in an alignment of genomic sequences.

Assessing the impact of phylogenetic footprinting on the specificity of binding-site predictions

In order to assess quantitatively the contribution of comparative sequence analysis to the specificity of TFBS predictions, a reference collection of 14 well-studied genes was assembled. We compared the selectivity and sensitivity of the TFBS predictions between those generated with isolated human sequences and those generated with the same human genes filtered by comparative analysis with orthologous mouse gene sequences (Table 1). The sequence pairs ranged in length between 680 and 2,900 base-pairs (bp), but all included the region -500 to +100 relative to the transcription start site. Within the 14 paired sequences are 40 experimentally defined TFBSs (Table 1) for 13 distinct TFs within the set of available matrices. For clarity, these binding sites were not utilized in the construction of the matrix models. A conservation cutoff was set to 70% for all tests, while the window size for conservation analysis was set to 50 bp.

Table 1 The reference collection of 14 gene pairs and 40 verified transcription-factor-binding sites used for testing

Full size table

Selectivity

Insufficient experimental data are available to confidently classify predictions as false, because many functional sites remain to be discovered. As the population of true TFBSs within a genomic sequence is anticipated to be small, we define the false-positive rate as the total number of predictions from all models divided by the length of the query sequence. The number of predicted TFBSs was determined for incrementally increasing relative matrix score thresholds (described in the Materials and methods section) between 65% and 90% for both single sequences and the corresponding orthologous pairs:

where M is the set of 108 models, P_m,cthe number of predicted sites using model m and relative matrix score threshold c, and L the length of the analyzed sequence in base-pairs (Figure 2a).

Predictive selectivity (measured by the average number of predicted TFBSs per 100 bp of promoter sequence when scanning with all models) improved by 85% (average ratio: 0.15) when phylogenetic footprinting is applied. The ratios of the observed selectivity scores using phylogenetic footprinting to those obtained using single-sequence analysis modes are shown in Figure 2c.

Sensitivity

Sensitivity measures the ability to correctly detect known sites (that is, when a prediction and an annotated TFBS overlap by at least 50% of the width of the thinnest pattern), given a corresponding transcription-factor binding-profile model. Analyses were performed with incrementally increasing relative matrix score thresholds between 65% and 90%. The overall sensitivity (the fraction of known sites detected) was reduced slightly under the conservation requirement: 65.5% were detected with phylogenetic footprinting (settings of 75% relative matrix score threshold, 70% identity cut-off, 50 bp window) as compared to 72.5% when analyzing single sequences (Figure 2b). The fact that a few sites were not detected with the stringent requirements for both regional sequence and specific-site conservation can be attributed to multiple causes. For instance, TFBSs may not be conserved or may be present but not detected by the profile under the thresholds. We conclude that most experimentally annotated binding sites are located within conserved regions, as we can correctly detect 82.5% of the TFBSs with a score threshold of 60%, using orthologous gene pairs (data not shown). Ratios of the sensitivity results obtained using single-sequence analysis to those obtained using phylogenetic footprinting, are shown in Figure 2c.

Performance assessment with an extended phylogenetic footprinting TFBS reference collection

Assessment of comparative genome analysis methods requires a broad collection of reference data to insure that algorithms and settings are not overly oriented towards a few genes or factors. A phylogenetic footprinting reference collection was assembled on the basis of the TRANSFAC database [20, 21] (as described in the Materials and methods section). For the identification of orthologous genes, only intragenic regions (exons and introns) were used (that is, no potential promoters were included). In any such large-scale mapping, it is of critical importance to find truly orthologous sequences, as opposed to pseudogenes or homologs which have no selective pressure to retain functional binding sites. Our selection process resulted in 110 uniquely mapped TFBSs in 57 promoters of human-mouse orthologous gene pairs (available at [22]). The reference collection does not overlap with the initial set of 14 reference genes described above.

The promoter regions from the reference set were analyzed using the same procedures as were applied above (Figure 2d,e,f). In spite of the likelihood that the new reference collection will have greater noise than the small set collected by detailed literature analysis, the performance results are comparable between sets. The sensitivity is slightly lower for the large collection (Figure 2e,f), which in addition to the potential difference in annotation standards could be attributable to the TFs associated with the sites. The average information content of the models for TFs linked to sites in the reference collection is lower than that for the factors associated with the small test set (median information content: 9.7, as compared to 15.3 bits in the first test set). Selectivity performance is virtually identical to the test (Figure 2d,F).

Web implementation

The algorithm described for the identification of regulatory regions by comparative sequence analysis has been implemented as an intuitive and easy to use web service named ConSite [23]. The implementation allows for three analysis modes: first, alignment and conserved-site analysis of two orthologous genomic sequences applying one or more TF profiles; second, conserved site analysis on a submitted alignment, which allows users to generate alignments from their preferred tools and allows for the analysis of longer genomic sequences; and third, a single-sequence analysis tool. The single-sequence service is functionally comparable to the TESS system [24], but utilizes the JASPAR profile collection [15]. Alignment submission accepts the de facto standard CLUSTALW format [25]. In all operating modes, users are allowed to submit a cDNA sequence to define exon locations. Users may also submit new matrix profiles of their own construction.

Results can be obtained in three distinct report formats. Graphical view (Figure 3a) displays an alignment overview and conservation plots with x-axis reference for each submitted sequence. Positions of conserved TFBSs are indicated above the plot. The transcription-factor labels are equipped with mouse-over function to display additional data (the name and structural class of the factor, and the absolute and relative site scores), and are hyperlinked to further information on the TF and its binding profile (Figure 3b). The pop-up windows provide data summaries, including a sequence logo (graphical representation of the specificity of the profile based on position-specific information content [26]) with the corresponding profile from the database. Alignment view (Figure 3c) provides a detailed overview of the detected potential TFBSs displayed on the sequence. The numbering indicates positions in the actual sequences, and the predicted TFBSs are marked. For convenience, a tabular output of detected sites with associated details is also provided in Table view.

Discussion

Comparison of orthologous genomic sequences is an effective method for the identification of segments likely to mediate a sequence-specific biological function. The performance of phylogenetic footprinting methods for the detection of TFBSs is dependent upon multiple factors, including the alignment algorithm, the available binding profiles and the evolutionary distance between the target sequences. Two key data resources are introduced in this study: a novel collection of transcription-factor binding profiles compiled from the biological research literature and a reference test set for phylogenetic footprinting methods. The ConSite web interface to the system facilitates user control, an essential feature for users studying diverse genomes.

The binding profile collection is an important resource for bioinformatics projects. Like the TFBS programming system [27], the JASPAR profile collection is available freely to the research community [15]. The profiles are non-redundant and are restricted to those cases for which sufficient binding data were available to generate a meaningful representation of the binding specificity of a TF. Continuing expansion of the collection is anticipated, given the strong research progress in modeling DNA binding sites [28].

The new phylogenetic footprinting reference collection of TFBSs allows for quantitative assessment of the performance of new methods. This is the largest collection of its kind available for broad use. In our study, we could detect around 68% of the experimentally defined TFBSs in conserved segments (at 65% relative matrix score threshold; see Figure 2). This differs slightly from the outcome of a study of conservation properties proximal to TFBSs [29], which indicated that only around 50% of sites are situated in conserved regions. There are several key factors that may account for this difference. The procedures for defining the collections were different. For instance, the amount of flanking sequence used for mapping the locations of the sites onto genome sequences was lower in the previous study. These short fragments were mapped onto a commercial human genome assembly and the mapped regions compared to shotgun-generated fragments of mouse genomes from multiple strains. The alignment procedures were also different, with the older set aligned by BLAST [30] and assessed by a stringent similarity threshold (> 80% identity over 40 bp). There was no exclusion of pseudogenes or paralogous genes indicated in the previous study, which would result in decreased sensitivity due to the erroneous application of phylogenetic footprinting to genes evolving under distinct evolutionary pressures.

While the work presented here focuses on mammalian sequence comparisons, there is no limitation within the ConSite system precluding studies of other organisms (the ConSite website includes samples with insect and nematode sequences). In the future it will be important to develop methods capable of analyzing multiple genomic sequences in parallel, but this is a non-trivial task. Such a system must allow for weighting based on evolutionary distances to preserve sensitivity, and requires advances in multiple sequence alignment algorithms. Some steps in this direction are beginning to emerge [31, 32].

No single resource offers the same set of functions or integration as ConSite. The only similarly scoped resource is the recently published rVista [33], which searches for TFBSs in a reference sequence and filters the results for sites in regions of high conservation with respect to a second genomic sequence. Unlike rVista, ConSite searches both sequences for TFBSs, for better specificity, and enables easy modification of the parameters for interactive analysis, as well as providing different output formats to aid the design and interpretation of experiments in molecular biotechnology. ConSite's publicly available collection of transcription-factor profiles allows users to access information about the TFs associated with the predicted sites. Given that many users focus on a specific TF and have developed high-quality models of their own, ConSite also allows for user-defined profiles.

We present an algorithm that uses phylogenetic footprinting to identify potential TFBSs. The approach to identifying regulatory elements presented here yields greater specificity than previous approaches that were based purely on profile searches of single genomic sequences. In short, using phylogenetic footprinting to filter the computational predictions significantly reduces noise at the price of a slight decrease in sensitivity. The web application we present enables researchers to utilize this approach in a straightforward manner. With the culmination of the human and mouse genome sequencing efforts [34, 35], we believe this new algorithm will be of significant use in the ongoing efforts to ascribe function to non-coding sequences.

Materials and methods

Genomic sequence alignment

As a result of the low overall similarity of non-coding regions across moderate evolutionary distances (for example, between human and mouse), many alignment algorithms will fail to produce biologically meaningful alignments or will require an arduous process to tune the algorithm parameters. In order to obtain high-quality global alignments, we utilized the DPB algorithm (L.M. and W.W., unpublished; see [23]), which is optimized for the global alignment of long genomic sequences containing short, colinear segments of similarity.

Measurement of local similarity in global alignments

The most common approach used to measure local similarity between two globally aligned orthologous sequences utilizes a fixed-size sliding window to scan an alignment and identify segments containing a minimum number of identical nucleotides. The difficulties that arise with sliding-window approaches are related to the treatment of edges and gaps in the alignment. Sliding a window along the alignment itself will assign a low identity score to short regions of high identity flanked by long regions of greater variation (for example, a large gap or insertion in one of the sequences). We elected to collapse the gaps in the alignment (that is, to remove the positions containing gaps in the sequence in question) and to calculate a separate conservation profile for each orthologous sequence.

Classification of motif-match conservation within aligned genomic sequences

Within the conserved segments, conserved sites are detected by, firstly, scanning each of the two orthologous sequences with position-specific weight matrices [1] for the TFs of interest, and secondly, retaining only those predicted sites (for each given TF model) that are in equivalent positions in the alignment. The scores for matches to the position-specific weight matrix models must exceed the user-defined relative matrix score threshold.

Collection and annotation of binding models

All profiles are derived from published collections of experimentally defined TFBSs for multicellular eukaryotes. The database, named JASPAR [15], represents a curated collection of target sequences. The motif-detection program ANN-Spec [36] was used to align each binding site set. The ANN-Spec alignments were performed with a range of motif widths, using three random seeds and 80,000 iterations. The profile matrices and associated information are stored in a relational database (MySQL); a flat file representation of the data is available for academic use [22]. Users may also submit their own profiles for private use within the ConSite system.

Identification of relative matrix score thresholds

Candidate TFBSs in individual sequences have a score as determined by the position weight matrix for the given sequence, which has been reviewed elsewhere [1]. The score ranges are unique for each binding model, so it is advantageous to convert the score range to a common, relative unit scale as given by

Score ranges are used for defining relative matrix score thresholds. The applied scoring method is in direct relation to the protein-DNA binding energy [1], and it therefore does not take into account statistical significance of an observed motif in relation to the local nucleotide composition (for example, GC-rich regions). The influence of the background distribution on the protein-DNA interaction is poorly understood. This is recognized as an open problem within the field, as it is highly controversial whether the surrounding base composition could have any influence on the thermodynamics of binding [37]. For these reasons, we choose to score the matrix profiles using a uniform base composition.

Parameter settings and manipulation

In all three analysis modes the user can choose relative matrix score thresholds (default 80%). In alignment analysis modes, one can also choose the size of the sliding window (default 50 nucleotides) and the conservation cutoff (percentage sequence identity within the window for the definition of conserved regions). There is no fixed default value for the latter parameter; instead, the conservation cutoff is set to retain the top 10% of conserved windows (based on nucleotide identity within a window of sequence in the alignment). This latter mechanism was motivated by the different rates of evolution across genomes.

Matrix manipulation, site detection and phylogenetic footprinting

For matrix manipulation, TFBS detection and some other actions (such as sequence 'logo' drawing) we intensively used the 'TFBS software', a set of object-oriented Perl modules (with extensions in C and C++) developed for the acceleration of promoter analysis scripting [38].

The phylogenetic footprinting TFBS reference collection

An initial set of annotated binding sites was identified from TRANSFAC (version 4.0) [20, 21] for human (662 sites) and mouse (376 sites). Each binding site was extended with 50 bp of flanking sequence in both directions from the respective promoter to allow unambiguous mapping onto the corresponding genome assembly (human version hg13 and mouse version mm2 [39, 40]). Only sites bound by a TF with a corresponding matrix model in the JASPAR collection were kept.

In order to define orthology without regard to the sequences flanking the binding sites (which would introduce circularity problems), we defined human-mouse pairings on the basis of cDNA sequences. The mappings of GenBank [41] and RefSeq [42, 43] cDNAs to the assemblies were obtained from the UCSC Genome Browser Database [39, 40]. In addition 50,821 mouse cDNAs from the RIKEN project [44] were mapped to the mouse genome assembly using the client/server version of BLAT [45] with default settings. In brief, for all mappings of a given cDNA, we consider only those with cDNA coverage > 75% and with > 99% sequence identity to the genomic sequence, then sort the set by (number of matches)*(cDNA coverage), and finally take the first mapping in the sorted set.

Each promoter fragment was mapped to its corresponding genome assembly using BLAT, as above. Extended site sequences that unambiguously mapped to the promoter region of the TRANSFAC annotated gene were kept. For each mapped TRANSFAC binding site, the nearest downstream cDNA mapping was located and the GeneLynx record containing that cDNA retrieved. cDNAs with mouse-human ortholog pairs defined in the GeneLynx Mouse [46] database were retained.

For a pair of cDNA sequences thus identified, the genomic sequences spanning representative mappings were extracted and aligned, using BLASTZ [47] (default settings). For each aligned sequence pair, the alignment coverage and the similarities in gene structure as indicated by the mappings were manually evaluated to select not more than one orthologous region per initial TFBS-cDNA-GeneLynx identifier 'triplet'. Promoter-region pairs corresponding to 1,000 bp upstream of the binding site and 100 bp into the first exon were extracted, using the BLASTZ alignment as reference.

References

Stormo GD: DNA binding sites: representation and discovery. Bioinformatics. 2000, 16: 16-23. 10.1093/bioinformatics/16.1.16.
Article CAS PubMed Google Scholar
Tronche F, Ringeisen F, Blumenfeld M, Yaniv M, Pontoglio M: Analysis of the distribution of binding sites for a tissue-specific transcription factor in the vertebrate genome. J Mol Biol. 1997, 266: 231-245. 10.1006/jmbi.1996.0760.
Article CAS PubMed Google Scholar
Fickett JW: Quantitative discrimination of MEF2 sites. Mol Cell Biol. 1996, 16: 437-441.
Article PubMed Central CAS PubMed Google Scholar
Gumucio DL, Heilstedt-Williamson H, Gray TA, Tarle SA, Shelton DA, Tagle DA, Slightom JL, Goodman M, Collins FS: Phylogenetic footprinting reveals a nuclear protein which binds to silencer sequences in the human gamma and epsilon globin genes. Mol Cell Biol. 1992, 12: 4919-4929.
Article PubMed Central CAS PubMed Google Scholar
Pennacchio LA, Rubin EM: Genomic strategies to identify mammalian regulatory sequences. Nat Rev Genet. 2001, 2: 100-109. 10.1038/35052548.
Article CAS PubMed Google Scholar
Fickett JW, Wasserman WW: Discovery and modeling of transcriptional regulatory regions. Curr Opin Biotechnol. 2000, 11: 19-24. 10.1016/S0958-1669(99)00049-X.
Article CAS PubMed Google Scholar
Loots GG, Locksley RM, Blankespoor CM, Wang ZE, Miller W, Rubin EM, Frazer KA: Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons. Science. 2000, 288: 136-140. 10.1126/science.288.5463.136.
Article CAS PubMed Google Scholar
Batzoglou S, Pachter L, Mesirov JP, Berger B, Lander ES: Human and mouse gene structure: comparative analysis and application to exon prediction. Genome Res. 2000, 10: 950-958. 10.1101/gr.10.7.950.
Article PubMed Central CAS PubMed Google Scholar
Jareborg N, Durbin R: Alfresco – a workbench for comparative genomic sequence analysis. Genome Res. 2000, 10: 1148-1157. 10.1101/gr.10.8.1148.
Article PubMed Central CAS PubMed Google Scholar
Wasserman WW, Palumbo M, Thompson W, Fickett JW, Lawrence CE: Human-mouse genome comparisons to locate regulatory sites. Nat Genet. 2000, 26: 225-228. 10.1038/79965.
Article CAS PubMed Google Scholar
Krivan W, Wasserman WW: A predictive model for regulatory sequences directing liver-specific transcription. Genome Res. 2001, 11: 1559-1566. 10.1101/gr.180601.
Article PubMed Central CAS PubMed Google Scholar
Gelfand MS, Novichkov PS, Novichkova ES, Mironov AA: Comparative analysis of regulatory patterns in bacterial genomes. Brief Bioinform. 2000, 1: 357-371.
Article CAS PubMed Google Scholar
McCue L, Thompson W, Carmack C, Ryan MP, Liu JS, Derbyshire V, Lawrence CE: Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. Nucleic Acids Res. 2001, 29: 774-782. 10.1093/nar/29.3.774.
Article PubMed Central CAS PubMed Google Scholar
Pollock R, Treisman R: A sensitive method for the determination of protein-DNA binding specificities. Nucleic Acids Res. 1990, 18: 6197-6204.
Article PubMed Central CAS PubMed Google Scholar
JASPAR database. [http://www.phylofoot.org/consite/download]
Kimura M: Preponderance of synonymous changes as evidence for the neutral theory of molecular evolution. Nature. 1977, 267: 275-276.
Article CAS PubMed Google Scholar
Shabalina SA, Ogurtsov AY, Kondrashov VA, Kondrashov AS: Selective constraint in intergenic regions of human and mouse genomes. Trends Genet. 2001, 17: 373-376. 10.1016/S0168-9525(01)02344-7.
Article CAS PubMed Google Scholar
Duret L, Bucher P: Searching for regulatory elements in human noncoding sequences. Curr Opin Struct Biol. 1997, 7: 399-406. 10.1016/S0959-440X(97)80058-9.
Article CAS PubMed Google Scholar
Arnone MI, Davidson EH: The hardwiring of development: organization and function of genomic regulatory systems. Development. 1997, 124: 1851-1864.
CAS PubMed Google Scholar
Matys V, Fricke E, Geffers R, Gossling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel AE, Kel-Margoulis OV, et al: TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res. 2003, 31: 374-378. 10.1093/nar/gkg108.
Article PubMed Central CAS PubMed Google Scholar
TRANSFAC – The Transcription Factor Database. [http://transfac.gbf.de/TRANSFAC/]
Extended TFBS test set. [http://www.phylofoot.org/consite/testset]
Phylofoot.org tools for phylogenetic footprinting. [http://www.phylofoot.org/]
TESS: Transcription Element Search System. [http://www.cbil.upenn.edu/tess/]
Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22: 4673-4680.
Article PubMed Central CAS PubMed Google Scholar
Schneider TD, Stephens RM: Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990, 18: 6097-6100.
Article PubMed Central CAS PubMed Google Scholar
Lenhard B, Hayes WS, Wasserman WW: GeneLynx: a gene-centric portal to the human genome. Genome Res. 2001, 11: 2151-2157. 10.1101/gr.199801.
Article PubMed Central CAS PubMed Google Scholar
Bulyk ML, Huang X, Choo Y, Church GM: Exploring the DNA-binding specificities of zinc fingers with DNA microarrays. Proc Natl Acad Sci USA. 2001, 98: 7158-7163. 10.1073/pnas.111163698.
Article PubMed Central CAS PubMed Google Scholar
Levy S, Hannenhalli S: Identification of transcription factor binding sites in the human genome sequence. Mamm Genome. 2002, 13: 510-514. 10.1007/s00335-002-2175-6.
Article CAS PubMed Google Scholar
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.
Article PubMed Central CAS PubMed Google Scholar
Blanchette M, Schwikowski B, Tompa M: Algorithms for phylogenetic footprinting. J Comput Biol. 2002, 9: 211-223. 10.1089/10665270252935421.
Article CAS PubMed Google Scholar
Boffelli D, McAuliffe J, Ovcharenko D, Lewis KD, Ovcharenko I, Pachter L, Rubin EM: Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science. 2003, 299: 1391-1394. 10.1126/science.1081331.
Article CAS PubMed Google Scholar
Loots GG, Ovcharenko I, Pachter L, Dubchak I, Rubin EM: rVista for comparative sequence-based discovery of functional transcription factor binding sites. Genome Res. 2002, 12: 832-839. 10.1101/gr.225502. Article published online before print in April 2002.
Article PubMed Central PubMed Google Scholar
Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, et al: Initial sequencing and analysis of the human genome. Nature. 2001, 409: 860-921. 10.1038/35057062.
Article CAS PubMed Google Scholar
Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, et al: Initial sequencing and comparative analysis of the mouse genome. Nature. 2002, 420: 520-562. 10.1038/nature01262.
Article CAS PubMed Google Scholar
Workman CT, Stormo GD: ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. Pac Symp Biocomput. 2000, 467-478.
Google Scholar
Schneider TD: Measuring molecular information. J Theor Biol. 1999, 201: 87-92. 10.1006/jtbi.1999.1012.
Article CAS PubMed Google Scholar
Lenhard B, Wasserman WW: TFBS: Computational framework for transcription factor binding site analysis. Bioinformatics. 2002, 18: 1135-1136. 10.1093/bioinformatics/18.8.1135.
Article CAS PubMed Google Scholar
Karolchik D, Baertsch R, Diekhans M, Furey TS, Hinrichs A, Lu YT, Roskin KM, Schwartz M, Sugnet CW, Thomas DJ, et al: The UCSC Genome Browser Database. Nucleic Acids Res. 2003, 31: 51-54. 10.1093/nar/gkg129.
Article PubMed Central CAS PubMed Google Scholar
Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The human genome browser at UCSC. Genome Res. 2002, 12: 996-1006. 10.1101/gr.229102. Article published online before print in May 2002.
Article PubMed Central CAS PubMed Google Scholar
GenBank. [http://www.ncbi.nlm.nih.gov/Genbank/index.html]
Pruitt KD, Maglott DR: RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 2001, 29: 137-140. 10.1093/nar/29.1.137.
Article PubMed Central CAS PubMed Google Scholar
RefSeq. [http://www.ncbi.nlm.nih.gov/RefSeq/]
Okazaki Y, Furuno M, Kasukawa T, Adachi J, Bono H, Kondo S, Nikaido I, Osato N, Saito R, Suzuki H, et al: Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature. 2002, 420: 563-573. 10.1038/nature01266.
Article PubMed Google Scholar
Kent WJ: BLAT – the BLAST-like alignment tool. Genome Res. 2002, 12: 656-664. 10.1101/gr.229202. Article published online before March 2002.
Article PubMed Central CAS PubMed Google Scholar
Lenhard B, Wahlestedt C, Wasserman W: GeneLynx Mouse: integrated portal to the mouse genome. Genome Res. 2003,
Google Scholar
Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W: Human-mouse alignments with BLASTZ. Genome Res. 2003, 13: 103-107. 10.1101/gr.809403.
Article PubMed Central CAS PubMed Google Scholar
MEDLINE. [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi]
Cao A, Moi P: Regulation of the globin genes. Pediatr Res. 2002, 51: 415-421.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

This project was supported by funds from the Karolinska Institute and the Pharmacia Corporation.

Author information

Luis Mendoza
Present address: Serono Research and Development, CH-1121, Geneva 20, Switzerland
Niclas Jareborg
Present address: AstraZeneca Research and Development, S-151 85, Södertälje, Sweden
Wyeth W Wasserman
Present address: Centre for Molecular Medicine and Therapeutics, University of British Columbia, Vancouver, BC, V5Z 4H4, Canada

Authors and Affiliations

Center for Genomics and Bioinformatics, Karolinska Institutet, 171 77, Stockholm, Sweden
Boris Lenhard, Albin Sandelin, Luis Mendoza, Pär Engström, Niclas Jareborg & Wyeth W Wasserman

Authors

Boris Lenhard
View author publications
You can also search for this author in PubMed Google Scholar
Albin Sandelin
View author publications
You can also search for this author in PubMed Google Scholar
Luis Mendoza
View author publications
You can also search for this author in PubMed Google Scholar
Pär Engström
View author publications
You can also search for this author in PubMed Google Scholar
Niclas Jareborg
View author publications
You can also search for this author in PubMed Google Scholar
Wyeth W Wasserman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wyeth W Wasserman.

Additional information

Boris Lenhard, Albin Sandelin contributed equally to this work.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lenhard, B., Sandelin, A., Mendoza, L. et al. Identification of conserved regulatory elements by comparative genome analysis. J Biol 2, 13 (2003). https://doi.org/10.1186/1475-4924-2-13

Download citation

Received: 12 December 2002
Revised: 21 March 2003
Accepted: 08 April 2003
Published: 22 May 2003
DOI: https://doi.org/10.1186/1475-4924-2-13

Identification of conserved regulatory elements by comparative genome analysis

Abstract

Background

Results

Conclusions

Introduction

Results

A non-redundant set of high-quality transcription-factor binding models

Integrating binding-site prediction with analysis of sequence conservation in orthologous genomic sequences

Assessing the impact of phylogenetic footprinting on the specificity of binding-site predictions

Selectivity

Sensitivity

Performance assessment with an extended phylogenetic footprinting TFBS reference collection

Web implementation

Discussion

Materials and methods

Genomic sequence alignment

Measurement of local similarity in global alignments

Classification of motif-match conservation within aligned genomic sequences

Collection and annotation of binding models

Identification of relative matrix score thresholds

Parameter settings and manipulation

Matrix manipulation, site detection and phylogenetic footprinting

The phylogenetic footprinting TFBS reference collection

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Journal of Biology