From Co-expression to Co-regulation: How many microarray experiments do we need?


Ka Yee Yeung, Mario Medvedovic, and Roger E. Bumgarner


Abstract
Background: Cluster analysis is often used to infer regulatory modules or biological function by associating unknown genes with other genes that have similar expression patterns and known regulatory elements or functions. However, clustering results may not have any biological relevance.

Results: We applied various clustering algorithms to microarray datasets with different sizes, and we evaluated the clustering results by determining the fraction of gene pairs from the same clusters that share at least one known common transcription factor. We used both yeast transcription factor databases (SCPD, YPD) and Chromatin Immunoprecipitation (ChIP) data to evaluate our clustering results. We showed that the ability to identify co-regulated genes from clustering results is strongly dependent on the number of microarray experiments used in cluster analysis and the accuracy of these associations plateaus at between 50 and 100 experiments on yeast data. Moreover, the model-based clustering algorithm MCLUST consistently outperforms more traditional methods in accurately assigning co-regulated genes to the same clusters on standardized data.

Conclusions: Our results are consistent with respect to independent evaluation criteria which strengthen our confidence in our results. However, when one compares ChIP data to YPD, the false negative rate is approximately 80% using the recommended p-value of 0.001. In addition, we showed that even with large numbers of experiments, the false positive rate may exceed the true positive rate. In particular, even when all experiments are included, the best results produce clusters with only a 28% true positive rate using known gene transcription factor interactions.



  • Supplementary Materials pdf file

  • Software
    • Our implementations for heuristic-based clustering algorithms (written by Ka Yee Yeung) was developed on Red Hat 7.1 using Java SDK1.4. The executables and documentation are available from the Supplementary Website of our previous work.
    • Mclust was written in Fortran and Splus by Chris Fraley and Adrian Raftery.
    • IMM was written in C++ by Mario Medvedovic.
    • Our implementations for evaluating the proportions of co-regulated genes from clustering results are written by Ka Yee Yeung in Perl 5 on linux. Theoretically, you should be able to run my Perl scripts on your windows box if you change the first line and install Active Perl.

  • Microarray data sets used:

  • Transcription factor resources:

  • Related Previous Work:
    • Medvedovic et al. 2004: Bayesian mixture model based clustering of replicated microarray data. Bioinformatics 2004 20:1222-1232.
    • Yeung et al. 2003: Clustering gene expression data with repeated measurements

  • Other publicly available software packages used in this work: