|
Background: Cluster analysis is often used to infer regulatory
modules or biological function by associating unknown genes with
other genes that have similar expression patterns and known regulatory
elements or functions. However, clustering results may not have any
biological relevance.
Results: We applied various clustering algorithms to microarray
datasets with different sizes, and we evaluated the clustering results
by determining the fraction of gene pairs from the same clusters that
share at least one known common transcription factor. We used both yeast
transcription factor databases (SCPD, YPD) and Chromatin Immunoprecipitation
(ChIP) data to evaluate our clustering results. We showed that the ability
to identify co-regulated genes from clustering results is strongly dependent
on the number of microarray experiments used in cluster analysis and the
accuracy of these associations plateaus at between 50 and 100 experiments
on yeast data. Moreover, the model-based clustering algorithm MCLUST
consistently outperforms more traditional methods in accurately assigning
co-regulated genes to the same clusters on standardized data.
Conclusions: Our results are consistent with respect to
independent evaluation criteria which strengthen our confidence in
our results. However, when one compares ChIP data to YPD, the false
negative rate is approximately 80% using the recommended p-value of 0.001.
In addition, we showed that even with large numbers of experiments, the
false positive rate may exceed the true positive rate. In particular,
even when all experiments are included, the best results produce
clusters with only a 28% true positive rate using known gene transcription
factor interactions.
|