Clustering gene expression data with repeated measurements


Ka Yee Yeung, Mario Medvedovic and Roger E. Bumgarner


Abstract
Background: Clustering is a frequent methodology for the analysis of array data and numerous clustering algorithms are in common use. In addition, many research laboratories are generating array data with repeated measurements. While there are proposals in the literature making use of replicate-derived statistics to improve the selection of differentially expressed genes, there has been limited effort to improve clustering algorithms by incorporating repeated measurements. In addition, the biologist who wishes to make use of cluster analysis is faced with a plethora of algorithmic options and often has no basis on which to select a methodology for the analysis of his/her data set.

Results: Our main contributions are extensions of clustering techniques to take advantage of repeated measurements and an empirical study comparing the performance of different clustering approaches for array data. We evaluated the approach of weighing expression levels with variability estimates in similarity measures, assigning repeated measurements to the same subtrees in hierarchical agglomerative clustering algorithms, and an infinite mixture model-based approach with built-in error models for repeated measurements. We employ two assessment criteria to evaluate clustering results: accuracy with respect to external knowledge and cluster stability (reproducibility of clustering results on re-measured data).

Conclusions: We show that array data with repeated measurements yield more accurate and more stable clusters. Our study also provides guidance to the user who wishes to select one "good" algorithm for cluster analysis of gene expression data. In particular, we show that the model-based clustering approaches produce superior clusters.



  • Manuscript: Genome Biology 2003 4(5):R34

  • Erratum: Corrected formula for the error-weighted correlation. pdf file

  • Additional results pdf file

  • Executables of our software are available as command-line implementations.
    • Average over repeated measurements, (SD/CV)-weighted similarity approach and FITSS were implemented in java. The java code (written by Ka Yee Yeung) was developed on Red Hat 7.1 using Java SDK1.4. Theoretically, these bytecode files should also run on Windows or MacOS with the Java VM. But we provide no guarantee and offer no support on our software. (You need to install the Java VM before you can run these bytecode files. Download SDK1.4)
      • Documentation (pdf)
      • Bytecode files for hierarchical agglomerative algorithms (average linkage, complete linkage, centroid linkage, single linkage) using either average over repeated measurements or variability-weighted similarity
      • Bytecode files for k-means using either average over repeated measurements or variability-weighted similarity
      • Bytecode files for hierarchical agglomerative algorithms (average linkage, complete linkage, centroid linkage, single linkage) using FITSS (Force Into The Same Subtrees)
    • IMM (Infinite model-based mixture approach)

  • Other publicly available softwares used in this work:

  • Gene Expression data sets used:

  • Completely synthetic data sets used: (5 randomly generated synthetic data sets)