Mouse mRNA isolation
Mouse tissues were isolated from the following strains: ICR (whole brain, testis, skeletal muscle, heart, lung, liver, embryo at 15 days, embryo at 12.5 days, embryo at 9.5 days, mammary gland, placenta at 9.5 days and placenta at 12.5 days); CD1 (Charles River Laboratory, Wilmington, USA; cortex, cerebellum, striatum, hindbrain, midbrain, bone marrow, knee, teeth, mandible, calvaria, femur (bone marrow flushed diaphyses), tongue surface, snout, large intestine, thyroid, aorta, brown fat, lymph node, olfactory bulb, adrenal gland, prostate, digits, trachea, trigeminal nucleus); C3H (The Jackson Laboratory, Bar Harbor, USA; salivary gland, thymus, ovary, uterus, tongue, stomach, small intestine, spleen, colon, uterus, pancreas, epididymis, eye, bladder, skin); C57BL/6 (The Jackson Laboratory, Bar Harbor, USA; spinal cord); Black Swiss (NTac:NIHBS; Embryonic heads); and R1 (ES cells). With the exception of embryonic tissues and ES cells, tissues were harvested from 3–6 month-old mice. Following recommended University of Toronto protocols, mice were euthanized by barbiturate injection and tissues were dissected as quickly as possible (within 10 minutes), snap-frozen in liquid nitrogen, and preserved at -80°C until use. RNA was extracted using homogenization and Trizol reagent (Invitrogen, Carlsbad, USA) following the instructions from the manufacturer, and mRNA was purified as described previously .
Microarray probe design
A FASTA file of 42,192 known and predicted mRNAs (XM sequences) was obtained from Deanna Church at NCBI on July 9, 2002 and is posted as Additional data file 1. Interspersed repeats and low complexity DNA sequences were masked with Repeat-Masker . The 500 nucleotides from the 3' end of each mRNA were extracted and 10 non-overlapping Tm-balanced probes were generated using PrimerX  with default settings. The most unique among the 10 was identified on the basis of having the highest ΔG difference between the first (identical) and second most significant BLAST hits among the 42,192 initial XM mRNAs. Then, 41,699 probe sequences (those for which probes could be designed using this procedure) were submitted for oligonucleotide microarray production (Agilent Technologies, Palo Alto, USA). These arrays are manufactured using an ink-jet process, in which oligonucleotides are synthesized on the array by direct deposition of phosphoramidites . The specificity, sensitivity, and reproducibility of these 60-mer arrays has been described elsewhere in detail .
Among the probes on the array, 40,822 were unique; those that were not unique can be attributed primarily to gene duplications, predominantly pseudogenes of GAPDH, ribosomal proteins, and retrovirus-like sequences. To minimize the impact of redundancy on statistical analyses, we collapsed the data from 1,928 duplicated probes and XM sequences that were in these sequence families (including 100 probes duplicated between the two array designs) into 525 groups that shared identical probe sequence and/or were both annotated and regulated in the same way. We also mapped all of the XM sequences to the current version of the mouse genome (Build 32) and to three cDNA databases (UniGene, RefSeq, and Fantom II; see below) and identified 1,991 XM sequences in which XM sequences adjacent on the chromosome also mapped to the same cDNA; these were collapsed into 904 groups. The Additional data files include a table mapping the 41,699 probes against the 39,309 presumed distinct transcripts.
Labeling and hybridization
The mRNA (1–2 μg) was reverse-transcribed with random nonamer primers (1 μg per reaction) and T18VN (0.25 μg per reaction) to synthesize cDNA. The reaction contained a 1:1 mixture of 5-(3-aminoallyl) thymidine 5'-triphosphate (Sigma, St. Louis, USA) and thymidine triphosphate (TTP) in place of TTP alone. The cDNA products were bound to QIAquick PCR Purification columns (Qiagen, Hilden, Germany) following the manufacturer's instructions, washed three times with 80% ethanol, and eluted with water. Purified cDNA was reacted with N-hydroxy succinimide esters of Cy3 or Cy5 (Amersham Pharmacia Biotech, Piscataway, USA) following the manufacturer's instructions. Hydroxylamine-quenched Cy-labeled cDNAs were separated from free dye molecules using QIAquick columns. Mixed labeled cDNAs were added to hybridization buffer containing 1 M NaCl, 0.5% sodium sarcosine, 50 mM methyl ethane sulfonate (MES), pH 6.5, 33% formamide and 40 μg herring sperm DNA (Invitrogen, Carlsbad, USA). Hybridizations were carried out in a final volume of 0.5 ml injecting into an Agilent hybridization chamber at 42°C on a rotating platform in a hybridization oven (Robbins Scientific Corporation, Sunnyvale, USA) for 16–24 h. Slides were then washed (rocking for approximately 30 seconds in 6 × SSPE, 0.005% sarcosine, then rocking for approximately 30 seconds in 0.06 × SSPE) and scanned with a 4000A microarray scanner (Axon Instruments, Union City, USA). Hybridizations were performed in duplicate with fluor reversal: that is, each mRNA sample was examined in duplicate, once in the Cy3 channel and once in the Cy5 channel, on separate arrays. Each array was hybridized with two samples simultaneously, each from an individual tissue. Essentially identical results were obtained from single-channel data from the same mRNA sample analyzed on different arrays, which were distinct from individual channels on the same arrays analyzed with a different mRNA. The organization of the hybridizations, and the data for individual channels, are given in the Additional data files.
Image processing and normalization
TIFF images were quantitated with GenePix (Axon Instruments). Individual channels were spatially detrended (that is, overall correlations between spot intensity and position on the slide were removed) by high-pass filtering (see ) using 10% outliers. We applied variance stabilizing normalization (VSN)  using 25% of the genes to normalize all single channels to each other. We manually identified and removed measurements that were inconsistent between dye-swaps, by either removing data from residual artifacts apparent on microarray images or removing the higher of the two disparate intensity measurements (in order to minimize false-positive detections). Measurements were transformed to arcsinh values (which are similar to natural log values, but are defined for negative numbers which emerge from the VSN) and for each measurement the median across all arrays was subtracted to obtain relative expression ratios for each gene in each tissue compared to all tissues. Remaining inconsistencies between dye-swaps were addressed by removing the higher of any two measurements that differed by more than two arcsinh units (in order to further minimize false-positive detections). The dye-swap arcsinh values were then averaged between replicates and among multiple probes detecting the same sequence. Clustering and manual analysis indicated that ratios below zero were generally not biologically meaningful (and probably stem largely from measurement error among low-intensity spots); hence ratios below zero were set to zero for all analyses using median-subtracted arcsinh values (Figures 1, 2, 4, 5, 6, 7 and SVM analyses). Missing values (fewer than 0.01% of all data points) were set to zero. Median-subtracted arcsinh values correspond approximately to the following ratios (arcsinh = linear): 0 = 1/1; 1 = 2.7/1; 2 = 7.5/1; 3 = 20/1, 4 = 55/1; 5 = 155/1, 6 = 405/1.
Mouse GO-BP annotations were downloaded from the Gene Ontology website  and the European Bioinformatics Institute (EBI)  and both were mapped to XM gene sequences by sequence identity to the annotated source sequences. The full annotation database is on our website . Fewer than 0.01% of these annotations were derived from gene expression (IEP code); we confirmed that removal of these genes had no appreciable impact on statistical analysis or the SVM analysis, and hence the use of these annotations to analyze gene expression is not circular. The Mouse Genome Informatics (MGI) annotations are reported to be manually compiled, whereas the EBI annotations include automated sequence-based annotations (for example, potassium channels are annotated as being in 'ion transport' and the mouse homolog of the yeast Tim8 protein, which is a translocase of the inner mitochondrial membrane, is annotated as being in 'mitochondrial translocation'). All GO-BP annotations were propagated up all possible edges of the GO graph. Redundant GO-BP categories were excluded. Categories with fewer than three genes among the 21,622 expressed genes were excluded from our analysis since they are not appropriate for the statistical tests we used, and those with more than 500 genes were excluded because they are not specific to distinct physiological processes.
False-discovery rate analysis
Each gene was associated with a co-regulated group consisting of the 50 annotated genes with the highest Pearson correlation coefficient relative to it. Annotation enrichment of this group in each GO-BP category was scored using the hypergeometric P value . The minimum value of this score across all GO-BP categories was used as the measure for significant enrichment in any GO-BP category. P values were assigned to these measures using a permutation scheme on the gene labels. The statistical significance of the P values was evaluated using the Benjamini-Hochberg (BH) linear step-up procedure  to ensure a false discovery rate (FDR) of less than 1%. For annotated genes, a second measure was computed: the minimum among its annotated categories of the hypergeometric P values of its co-regulated group. A gene-specific permutation scheme associated P values with these scores and the FDR was also controlled at 1%.
Starting with an initial hierarchical clustering (agglomerative, average linkage, based on Pearson correlation coefficient), rows were divided into groups by removing a small number of links at the highest levels of the tree and grouping together all rows contained within the same disconnected subtree. Each row group was then associated with the column that contained the maximum expression value averaged over all the profiles in the group. The row groups were then sorted in increasing order of their associated column numbers.
Support vector machines
We used the SVM software package Gist  version 2.0.8 in Linux with parameter settings '-radial -zeromeanrow -diagfactor 0.5'. Precision was established by three-fold cross validation.
Identification of corresponding clones in cDNA and EST databases
We identified the closest corresponding mouse mRNAs in FANTOM II  (60,770 sequences); RefSeq  (16,601 sequences); UniGene  (87,495 sequences); and Ensembl  (32,911 sequences) using BLASTN with a threshold of E-60. We identified corresponding mouse mRNAs in dbEST  (3,939,961 sequences) using BLASTN with a threshold of E-20.
Identification of genes common to other microarray data and Spearman rank correlations
For Figure 2a, mRNA sequences were downloaded from  (for Su et al. data ) and  (for Bono et al. data ). The Su et al.  gene expression data were downloaded from  (9,977 sequences represented on the array) and the Bono et al.  data, from  (54,005 sequences represented on the array). The selected 41,699 NCBI mRNAs were used in a BLAST search against these two mRNA databases; a BLAST comparison between the two databases was also performed to retain only genes for which the closest sequence to each XM gene is also the closest sequence between the two other databases. All BLAST searches were performed with threshold E-60, and the best hit was selected for the multiple blast results. The 1,109 genes that have common hits in all the BLAST results and with gene expression data available were selected for the gene expression analysis. The 1,109 genes from all three datasets were normalized to make them comparable. To facilitate comparison, in the Bono et al.  dataset, each gene was median-centered in each tissue by subtracting its median expression value across all 13 common tissues. The Su et al.  data were arcsinh-transformed before median-centering. The data from the study described here that was used in the comparison was not zeroed, as it was in other analyses, and was median-centered using the median calculated only on the 13 common tissues, rather than all 55. The Spearman rank correlation coefficients of each pair of tissues among all three studies were transformed to Z-scores by multiplication by sqrt(1108) and then converted to P values using the cumulative probability density of a standard normal distribution.
For Figure 6c, an alternative mapping strategy was employed: our probe sequences, the Bono et al.  clone sequences, and the Su et al.  probe sequences were associated with 30,832 MGI sequences by mapping directly to corresponding MGI/GenBank sequences; 1,800 genes were identified in which a reciprocal best match between the probe sequences and the MGI sequence was identified in all three studies.
Primer pairs were designed to have a matching Tm (59°C) and sequences are listed in the Additional data files. RT-PCR assays were performed using the OneStep RT-PCR Kit (Qiagen). Reactions were performed in 25 μl volumes containing 0.5 ng polyA+ mRNA, 7.5 units porcine RNAguard (Amersham) and 300 pM each of the forward and reverse primers. After 30 rounds of amplification, the reaction products were separated on 2% agarose gels stained with ethidium bromide. Inverted black-and-white images of the gels were recorded using a Syngene gel documentation system and GeneSnap software (Synopics, Frederick, USA). In total, 107 primer pairs were tested. Of the 57 XM genes tested that corresponded to a known cDNA, 42 were among those that were amplified (74%). Of the 25 tested that corresponded to an EST but not to a known cDNA, 12 were amplified (48%). However, of the 25 tested that did not correspond to a cDNA or EST, only one was amplified (4%).
Identification of genes associated with gene traps
Six different gene-trap resources were searched to identify genes associated with gene trap ES cell lines. For BayGenomics , Centre for Modeling Human Disease (CMHD) , University of California Resource of Gene Trap Insertions , and Fred Hutchinson Cancer Research Center (FHCRC) , the gene-trap sequence tags were downloaded from the website and searched against the selected 41,699 mRNA sequences using BLASTN. For the German Genetrap Consortium (GGTC)  and Mammalian Functional Genomics Centre (MFGC) , the web-based BLAST servers were used to search the 41,699 mRNA sequences against their gene-trap sequence databases. The hits with lengths equal to or larger than 50 nucleotides, and identity equal to or larger than 98%, were considered to be associated with the gene-trap ES-cell lines.
RNA extraction, northern blotting, affinity purification, and mass spectrometry
7-PWP1 and isogenic wild-type control strains were created and analyzed as previously described for other essential yeast genes . Briefly, strains were exposed to 10 μg/ml doxycycline (Sigma) for a total of 24 h before harvesting for RNA extraction. RNA extraction and northern blotting were performed using standard protocols and oligonucleotide probes as described previously . TAP purification of Pwp1p was performed as previously described  using 1.3l culture volumes; gel-purified proteins were identified by MALDI-TOF mass spectrometry.