Co-regulation of mouse genes predicts function
© BioMed Central Ltd 2004
Published: 6 December 2004
Large-scale microarray analyses reveal that transcriptional co-regulation patterns can be remarkably helpful in predicting the function of novel mouse genes.
Hughes became something of a microarray aficionado during his postdoc at Rosetta Inpharmatics, LLC in Seattle, USA. He and his colleagues there demonstrated that a careful combination of genome-wide microarray analysis of gene expression patterns and sophisticated statistical methods could be used to predict gene function. Specifically, they showed that patterns of transcriptional co-regulation could effectively predict the biological function of novel genes . But those impressive studies were performed in a unicellular yeast, which has around 6,000 genes in total. It wasn't clear how well the approach would fare with larger mammalian genomes and the complexity of multicellular organisms. When Hughes moved to the University of Toronto, Canada, he was eager to give it a try. Mark Gerstein of Yale University says that the Hughes study has tackled an important problem in functional genomics: "That is, translating ideas that were found applicable in simple unicellular organisms to more complicated mammalian systems."
A mountain of microarray data
The online genome-annotation and gene-listing resources described in this article
NCBI XM sequences (from the non-redundant (NR) database
Non-redundant protein entries from a variety of sources, including translations from annotated coding regions in GenBank and RefSeq
A comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major model organisms
Annotated metazoan genomes
RIKEN FANTOM cDNA database
Functional annotation of mouse full-length cDNA clones
Genes annotated according to three structured networks of defined terms
Mining the resulting data mountain required a sophisticated bioinformatic approach. "You have to know what you are looking for and be able to formulate questions mathematically and execute them on a computer," notes Hughes. Hughes teamed up with computational colleagues in Brendan Frey's team and applied some fancy statistical tricks, such as 'variance stabilizing normalization', to allow comparison across the tissues, and implemented a learning algorithm called a support vector machine (SVM) . "If you have a bunch of points in two- or three-dimensional space, an SVM looks for ways to distinguish between the ones that have a given feature and the ones that don't. No one had used SVMs before on this scale. If we have 55 tissues, then we are looking at 21,000 objects in a 55-dimensional space and trying to separate the ones that have a function from those that don't."
The Canadian group is not the first to carry out such large-scale analyses of mammalian gene expression [4–6]. "But what I like about this paper is that it's really rock solid," says Stuart Kim of the Stanford University Medical Center, USA. "This is really believable stuff. It is really well grounded in the statistics, avoiding simplistic non-mathematical concepts like 'on and off' or 'two-fold up and two-fold down'. They did fairly sophisticated statistical analyses to make sure that the trends they were seeing were really valid. It's important to get better and better datasets published." John Hogenesch of Novartis Research Foundation Genomics Institute in San Diego, California, notes that " [Hughes'] application of SVMs and Gene Ontology to provide preliminary functional annotation for thousands of genes of unknown function is a major advance." The Hogenesch group is also creating an atlas of mammalian genes . "This approach had been used in yeast and worms, but it hadn't yet been applied to mammalian gene expression. Hughes' paper now provides testable hypotheses for the roles of thousands of genes in the genome."
An open resource at the click of a mouse
Hughes' analysis revealed that the results from the extensive mouse tissue-specific dataset correlates very well with the results of studies from other laboratories. One notable feature of the Hughes dataset is that it has been made openly accessible to the research community [1, 7]. The additional data with the published article, and the Hughes lab website, provide information about the microarray oligonucleotide sequences, the SVM predictions, gene annotation, and so on, all of which can be downloaded without restriction and free of charge.
Kim points out that this is really important. "I think that every person that works on mice should now go to this study and type in the name of their favorite gene(s) and see where it is expressed in 55 tissues. It will cost nothing and then you will know where it is expressed strongly. You can make sure there are no hidden surprises [in your experiments] or find out what the hidden surprises are." Hogenesch concurs: "Most users will use the database to see where their gene of interest is expressed and what pathway it might participate in. Others will use the dataset itself to ask questions using other methodologies (tissue-specific gene expression, regulatory-element analysis, functional classification, and so on). The types of things you can do with a dataset like this are numerous, which is why it's important that the data are available."
Kim's group is building large genetic networks based on microarray datasets . "We use more than just tissue specificity to build our networks – we use everything that we can grab. So, we will go and grab these data and fold them into ours. Our next paper will include 1,700 mouse microarrays folded into the human-yeast-fly-worm networks. In worms, many labs have used our resource and published some pretty awesome papers based on the genetic network." Kim thinks that the networks will be even more powerful in accelerating the pace of research in mammalian systems, where classical experimental approaches are slow and expensive. Mark Gerstein agrees: "This is an important advance in helping to unravel the functions of the tens of thousands of human genes using functional genomics approaches."
Hughes has enjoyed the transition from studying yeast to working on mice, and is eager to collaborate with mouse geneticists to test some of the predictions that come out of the current study. And he wants to understand more about the correlation between co-regulation patterns and gene function. "As a yeast researcher the thing that blows my mind is how many things animal cells do. I learned a lot just looking at all the functional categories and Gene Ontology," admits Hughes. "The correlation between transcriptional co-regulation and function is very strong. It's much, much higher than you would get if genes were just expressed at random. But it's not absolute either. So, annotating function is a hard problem to crack and that gives us plenty to work on."
- Zhang W, Morris QD, Chang R, Shai O, Bakowski MA, Mitsakakis N, Mohammad N, Robinson MD, Zirnglibl R, Somogyi E, Laurin N, Eftekharpour E, Sat E, Grigull J, Pan Q, Peng WT, Krogan N, Greenblatt J, Fehlings M, van der Kooy D, Aubin J, Bruneau BG, Rossant J, Blencowe BJ, Frey BJ, Hughes TR: The functional landscape of mouse gene expression. J Biol. 2004, 3: 21-PubMed CentralView ArticlePubMed
- Wu LF, Hughes TR, Davierwala AP, Robinson MD, Stoughton R, Altschuler SJ: Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nat Genet. 2002, 31: 255-265. 10.1038/ng906.View ArticlePubMed
- Brown MP, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M, Haussler D: Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci USA. 2000, 97: 262-267. 10.1073/pnas.97.1.262.PubMed CentralView ArticlePubMed
- Bono H, Yagi K, Kasukawa T, Nikaido I, Tominaga N, Miki R, Mizuno Y, Tomaru Y, Goto H, Nitanda H: Systematic expression profiling of the mouse transcriptome using RIKEN cDNA microarrays. Genome Res. 2003, 13: 1318-1323. 10.1101/gr.1075103.PubMed CentralView ArticlePubMed
- Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G: A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci USA. 2004, 101: 6062-6067. 10.1073/pnas.0400782101.PubMed CentralView ArticlePubMed
- Schadt EE, Edwards SW, GuhaThakurta D, Holder D, Ying L, Svetnik V, Leonardson A, Hart KW, Russell A, Li G: A comprehensive transcript index of the human genome generated using microarrays and computational approaches. Genome Biol. 2004, 5: R73-10.1186/gb-2004-5-10-r73.PubMed CentralView ArticlePubMed
- The functional landscape of mouse gene expression. [http://hugheslab.med.utoronto.ca/Zhang]
- Stuart JM, Segal E, Koller D, Kim SK: A gene-coexpression network for global discovery of conserved genetic modules. Science. 2003, 302: 249-255. 10.1126/science.1087447.View ArticlePubMed