Our primary activities in the creation of data analysis tools during the past few years involve improving on clustering and classification algorithms. In particular, we have been focused on creating tools that take advantage of replicate measurements or error estimates. Cluster analysis is a common approach to the discovery of patterns in complex data sets. As it applies to microarray data it is typically used in an exploratory mode to identify genes that share common expression patterns over multiple experiments, experiments (or samples) that share common expression patterns over multiple genes or both. For a brief tutorial on cluster analysis see Dr. KaYee Yeung's
presentation. Regardless of the method used for cluster analysis, it is important to consider whether objects (genes or experiments) that cluster together do so by chance - that is would the same cluster be created again with a replicate data set or a subset of the current data set. Prior to our work, a great deal of previous research focused on bootstrap or jack-knife approaches to evaluating the robustness of a given clustering result. That is, a given number of experiments (or genes) are left out, the cluster analysis is repeated and one looks to see if the relationships between genes (or experiments) are maintained. If one repeats this a number of times, the frequency with which a given relationship is maintained is an indication of the confidence one should have in that relationship. In most cases, if one had replicate measurements, only the average values were provided to the cluster analysis. When average values are fed to a clustering analysis, valuable information - the variability of each measurement - is lost.
Our work on cluster analysis takes advantage of repeated measurement either by using the variance as a weighting factor in the distance measure or, better yet, specifically modeling the repeated measures in Bayesian, model-based clustering approach. In addition, we tested a number of different clustering algorithms and distance measures on both real and synthetic data to carefully evaluate which methods are better able to recover known patterns in the data. We have also investigated how often genes that are co-expressed across a number of experiments are likely to be co-regulated by a common transcription factor. Finally, along similar lines, we have also developed classification tools that take advantage of repeated measures.
In addition to the above stand-alone code, we have also adopted
MeV as a vessel for distributing our methods. At present, the classification program we have developed is being added to MEV to provide a good GUI to our code. We have also modified MEV to allow it to connect to our in-house
and to our public .