Bayesian mixture model based clustering of replicated microarray data


Mario Medvedovic, Ka Yee Yeung and Roger E. Bumgarner


Abstract
Motivation: Identifying patterns of co-expression in microarray data by cluster analysis has been a productive approach to uncovering molecular mechanisms underlying biological processes under investigation. Using experimental replicates can generally improve the precision of the cluster analysis by reducing the experimental variability of measurements made under different experimental conditions. In such situations Bayesian mixtures allow for an efficient use of information by precisely modeling between- replicates variability.

Results: We developed Bayesian mixture based clustering procedure for clustering gene expression data with experimental replicates. In this approach, a Bayesian mixture model is extended to accommodate experimental replicates. Clusters of co-expressed genes are created from the posterior distribution of clusterings, which is estimated by a Gibbs sampler. Previously we established that this approach to clustering microarray data with experimental replicates outperforms alternative approaches based on traditional clustering approaches. Utility of both finite and infinite mixture models in this setting was investigated. By analyzing synthetic and the real-world datasets we established that the precise modeling of intra-gene variability is of important for accurate identification of co-expressed genes. Such modeling is possible only when replicated data is available. We also introduce a heuristic modification to the Gibbs sampler based on the "reverse annealing" principle. This modification effectively overcame the tendency of the Gibbs sampler to converge to different modes of the posterior distribution when started from different initial positions in high experimental variability situations. Finally, we demonstrate that the Bayesian infinite mixture model with "elliptical" variance structure is capable identifying the underlying structure of the data without knowing the "correct" number of clusters.