Robust Estimation of cDNA Microarray Intensities with Replicates

Raphael Gottardo, Adrian E. Raftery, Ka Yee Yeung, and Roger E. Bumgarner

We consider robust estimation of gene intensities from cDNA microarray data with repli- cates. Several statistical methods for estimating gene intensities from microarrays have been proposed, but there has been little work on robust estimation of the intensities. This is particularly relevant for experiments with replicates, because even one outlying replicate can have a disastrous eŽect on the estimated intensity for the gene concerned. Because of the many steps involved in the experimental process from hybridization to image analysis, cDNA microarray data often contain outliers. For example, an outlying data value could occur because of scratches or dust on the surface, imperfections in the glass, or imperfections in the array production. We develop a Bayesian hierarchical model for robust estimation of cDNA microarray intensities. Outliers are modeled explicitly using a t-distribution, and our model also addresses classical issues such as design effects, normalization, transformation, and nonconstant variance. Parameter estimation is carried out using Markov Chain Monte Carlo.

The method is illustrated using two publicly available gene expression data sets. The between-replicate variability of the intensity estimates is reduced by 64% in one case and by 83% in the other compared to raw log ratios. The method is also compared to the ANOVA normalized log ratio, the removal of outliers based on Dixon's test, and the lowess normalized log ratio, and the between-replicate variation is reduced by more than 55% relative to the best of these methods for both data sets.

We also address the issue of whether the image background should be removed when estimating intensities. It has been argued that one should not do so because it increases variability, while the arguments for doing so are that there is a physical basis for the im- age background, and that not doing so will bias the estimated log-ratios of differentially expressed genes downwards. We show that the arguments on both sides of this debate are correct for our data, but that by using our model one can have the best of both worlds: one can subtract the background without greatly increasing variability.

  • Technical Report: Technical Report 438, Department of Statistics, University of Washington. pdf file

  • Software: Our implementation, RAMA (Robust Analysis of MicroArrays), is freely available as a BioConductor Package.

  • Datasets:
    • Intensites were extracted using the SpotOn Image software. These intensities can be found from the SPT files, which are tab-delimited text files (and hence, can be viewed using any spreadsheet program such as Excel). Here is a documentation of the SPT file column headers.
    • Like and like data: archived data
    • HIV data
      • HIV1 data: CEM mock vs LAI 01 hr #1
      • HIV2 data: CEM mock vs LAI 01 hr #2