All systems GO for understanding mouse gene function

It is widely supposed that the tissue specificity of gene expression indicates gene function. Now, an extensive analysis of gene expression in the mouse reveals that quantitative measurement of expression levels in different tissues can contribute powerfully to the prediction of gene function.

One approach to discovering gene function in a mammal is to mutate the gene in a mouse, and a number of methods are available for introducing mutations into mouse genes. We are likely to see, over the next few years, a systematic effort that aims to obtain a variety of mutant alleles for every gene in the mouse genome [4,5]. But this is the easy part, and determining the phenotypic consequences of each mutation represents an effort many times greater than the generation of the mutations themselves. In addition, determining the phenotype of a mutant gene often begins with making some assumptions about the likely function of the gene on the basis a number of observations. One common starting point for the curious molecular geneticist is to ask, "Where is the gene expressed?" Tissuerestricted patterns of expression might be expected to tell us something about a gene's function and where to look for phenotypes when examining mutants -but this is fraught with pitfalls. For example, we might assume straightforwardly that expression in a particular tissue indicates that a gene plays some physiological role there. But many mutants fail to reveal phenotypes in at least some of the tissues in which the wild-type version of the gene is normally expressed. Revealing a function for the gene being studied in those tissues may be contingent on perturbations in other molecules or pathways, adding an extra layer of complexity to the analysis. Moreover, many genes are widely expressed, effectively nullifying expression patterns as a predictor. Thus, although they are widely used, it is clear that tissue-specific expression patterns are a very blunt tool in the molecular geneticist's armory. Now, in an article in Journal of Biology, Zhang and colleagues [6] have tackled this problem head-on. Beginning with the fact that analyses of gene-expression patterns have successfully been used in yeast and the nematode Caenorhabditis elegans to determine gene function, they surmised that similar approaches would be applicable in mammals and that comparison of quantitative gene-expression patterns would uncover co-regulated genes that may represent functional categories. If this were the case then a systematic determination of expression patterns for the bulk of genes across a wide variety of tissues in the mouse would be one route to determining novel gene function. In a tour de force that represents one of the most extensive analyses of mammalian gene expression published to date, they analyzed the expression patterns for 40,000 known and predicted mRNAs across 55 diverse tissues. Their analyses provide quite startling conclusions revealing that, in contrast to the simple binary output (expressed/not expressed) that is the usual representation of tissue-specific expression data, quantitative measurements contain critical information that is powerfully predictive of function.
The analysis [6] is based on data generated from a single dye-swap cDNA microarray experiment [7]. On the face of it, this appears a very small number of samples to support such an ambitious study. But some impressive qualitycontrol checks were put in place to ensure robustness of results, including comparison of measurements of known tissue-specific genes, cross-referencing of related studies, and reverse-transcription-coupled real-time PCR. Moreover, by constructing an empirical null-distribution for differential expression built via 'negative control' transcripts using non-coding, randomly generated and yeast transcripts, they were able to filter the 40,000 measurements down to 21,622 genes that could be confidently said to exhibit differential expression in at least some tissues.
In terms of scope, the study is similar to the work of Su et al. [8], who considered 46 and 45 human and mouse tissue lines, respectively, measured using Affymetrix gene chips. But a key departure is that Zhang and colleagues [6] chose to investigate the relationship between gene function, as specified by Gene Ontology 'Biological Process' (GO-BP) categories [9], and quantitative gene-expression measurements. The controlled hierarchical vocabulary that comprises the Gene Ontology includes one 'layer' describing the biological process such as signaling or RNA processing, in which a gene functions; other layers indicate cellular component (or localization) and molecular function. Gene Ontology thus provides a rich source of information that will become increasingly integrated into analysis of experimental data derived from the emerging '-omics' platforms, including transcriptomics. The combining of qualitative ontology models and quantitative gene expression in mouse functional genomics is a powerful and original approach that is likely to prove fruitful in other mammals and in cross-species comparative studies. In comparison, the conventional approach of examining tissue-specific expression clearly loses resolution to the point that the geneticist may miss many interesting functional profiles. Indeed, as Zhang and colleagues show [6], tissue specificity alone can be a poor predictor of gene function.
With this in mind, Zhang et al. [6] put forward their central hypothesis that the pattern of gene expression across tissues provides a multivariate discriminative signature of gene function: that is, knowing the expression level in several tissues at once provides a more detailed description of gene function. Visual examination of the gene-expression profiles appears to support this strongly (see Figure 1) -but the eye can be deceived. To test the assertion more rigorously, machinelearning (pattern-recognition) algorithms were used to infer a model to predict function for 7,387 genes labeled by Gene Ontology, using the expression measurements for the genes across the 55 tissues. If the authors' hypothesis is correct then the vector of 55 tissue-specific expression measurements should contain discriminative information on gene function and the algorithm should be able to classify correctly the corresponding GO-BP annotations. The results show that there is indeed significant predictive information, as compared to a control experiment using randomized gene labels. Zhang et al. [6] then proceeded to use the model to predict the physiological function for 12,123 unannotated genes, in other words, those with no associated GO-BP label. Of the 12,000 or so tested, Zhang et al. concentrated on a subset, 1,092 genes, which had predicted precision scores above 50%; this represents the subset of unannotated genes about which the algorithm is most confident in making a prediction. In order to see whether this confidence is warranted, supporting literature, protein-domain information and de novo functional analysis were used, all of which largely validated the predictions. Put together, these findings constitute conclusive evidence that cross-tissue patterns of gene expression can provide signatures of gene function.
In adopting a machine-learning approach, support vector machines (SVMs) were chosen for making predictions, but it might have been just as fruitful to have used less sophisticated methods. Often more simple models, such as linear discriminant analysis, can work nearly as well, and moreover they are more interpretable [10]. For instance, it would be interesting to examine the discrimination profiles for the various GO-BP categories, for example by investigating the signature of tissue-specific expression that distinguishes, say, a 'cell-cell adhesion' gene from a non-cell-cell adhesion gene (see Figure 1). Furthermore, it would be interesting to investigate the weight given to each tissue in classifying a particular functional category, and to report for each functional category which tissue's measurements appeared most informative for function classification. It seems apparent that not all of the 55 tissues would be necessary for every GO-BP classification: some functional categories are likely to be characterized by a small subset of tissues, while for others we may need a wider profile in order to reach reasonable precision. Such an analysis may well reveal interesting structure within the data.
Speculating a little further, the work opens up the intriguing prospect of using quantitative information to help refine and further validate the qualitative models themselves. For example, if a particular Gene Ontology category is poorly discriminated by gene-expression data, does this suggest that the ontological level is perhaps too coarse? Expression profiles may suggest a refinement of ontological categories; for example, as noted by Zhang et al. [6], the category 'cellcell adhesion' appears to contain three distinct sub-groups. Does this suggest that this GO-BP category may need refinement? In conclusion, the work by Zhang et al. [6] provides us with a clear message: a carefully designed study using Gene Ontology and quantitative expression profiles can reveal functional relationships and can be a powerful predictor of gene function. In addition, the study provides an important resource for the genetics community, one that will be built upon in the future as we attempt to provide a comprehensive picture of the roles and functional discriminations behind every gene in the mammalian genome.