
Bayesian mixture model based clustering of replicated microarray data 

Mario Medvedovic, Ka Yee Yeung and Roger E. Bumgarner 

Abstract 
Motivation:
Identifying patterns of coexpression in microarray data by cluster
analysis has been a productive approach to uncovering molecular mechanisms
underlying biological processes under investigation. Using experimental
replicates can generally improve the precision of the cluster analysis by
reducing the experimental variability of measurements made under different
experimental conditions. In such situations Bayesian mixtures allow for
an efficient use of information by precisely modeling between
replicates variability.
Results:
We developed Bayesian mixture based clustering procedure for clustering
gene expression data with experimental replicates. In this approach,
a Bayesian mixture model is extended to accommodate experimental
replicates. Clusters of coexpressed genes are created from the
posterior distribution of clusterings, which is estimated by a
Gibbs sampler. Previously we established that this approach to
clustering microarray data with experimental replicates outperforms
alternative approaches based on traditional clustering approaches.
Utility of both finite and infinite mixture models in this setting
was investigated. By analyzing synthetic and the realworld
datasets we established that the precise modeling of intragene
variability is of important for accurate identification of
coexpressed genes. Such modeling is possible only when replicated
data is available. We also introduce a heuristic modification to the
Gibbs sampler based on the "reverse annealing" principle. This
modification effectively overcame the tendency of the Gibbs
sampler to converge to different modes of the posterior distribution
when started from different initial positions in high experimental
variability situations. Finally, we demonstrate that the Bayesian
infinite mixture model with "elliptical" variance structure
is capable identifying the underlying structure of the data
without knowing the "correct" number of clusters.


 Manuscript: Bioinformatics 2004 20:12221232.
 Executables of our software are available as
commandline implementations.
The IMM
code was written in C++ by Mario Medvedovic.
 Web Supplement (pdf)
 Yeast galactose data:
 Full data Ideker et al. 2001
 Subset of 205 genes yeast galactose data used in the paper:
 Completely synthetic data sets used: (5 randomly generated
synthetic data sets)
 Readme (pdf)
 4 repeated measurements (400 genes, 20 experiments):
 sporadic data with 4 repeated measurements (400 genes, 20 experiments):
 1 (or no) repeated measurement (400 genes, 20 experiments):
 2 repeated measurements (400 genes, 20 experiments):
 3 repeated measurements (400 genes, 20 experiments):
 large datasets with 4 repeated measurements (10,000 genes, 20 experiments):
 Related previous work:
 Publicly available software packages used in this work:
   