Center for Pathway Inference
The UW Center for Pathway Inference
In January of 2005, Dr. Bumgarner and a large team of investigators from the UW submitted a $2.6M/year grant to the NIH National Centers for Biological Computation (NCBC) Program. This center is intended to bring together a group of computational scientists and biologists to create and test tools to infer biological pathways. While we won’t know if the Center is funded until late in 2005, this site is intended to provide a brief description of the center and updated links to research efforts of the assembled team.
Overview and Description of the Center
A fundamental issue facing modern biology is the development of methods and tools to convert genomic and functional genomics data into mechanistic understanding. Genome sequences and computational methods have provided us with tools to identify and annotate genes and other functional sequences with varying degrees of accuracy. This sequence information in turn enables tools such as microarrays, 2-D gels, protein mass spectrometry and yeast 2-hybrid assays to measure RNA and protein levels and physical interactions on genome-wide or near genome-wide scales. Genome information has also enabled the development of high-throughput genetic mapping technologies that allow one to rapidly trace a given phenotype or trait to specific genetic regions. Empowered by wealth of available genome information and the new tools of function genomics, biologists in almost every field are now producing genome-wide data of a variety of types (transcriptomics, proteomics, deletion and knock-down screens).

However, in most cases there is still a major gap between the collection of large-scale genomic/genetic data and the inference of biological mechanism for a particular state, disease, or clinical outcome. This gap between large-scale data collection and biological understanding has become extremely apparent in the last several years during which an increasing number of microarray and proteomics based works have been published in which little contribution to new biological understanding has been made.

The proposed center seeks to bridge this gap by creating a series of publicly available algorithms, tools, databases and methodologies that will assist in the inference of biological pathways via the integration of researchers’ data sets with predicted physical interaction networks and functional annotations. In recent years, there has been an emerging paradigm in which it has become apparent that to obtain mechanistic understanding of large-scale expression and other genomics data, it is necessary to view this data in the context of an underlying model of the interactions between the biomolecules whose levels/properties are being measured and other biomolecules in the system that influence these levels/properties[1-6]. Hence, there is a growing need to gather, predict and represent genome-wide physical interaction networks (for many genomes) and to make these networks and associated analytical and visualization tools publicly available to a broad cross section of researchers. In addition, there are still a large number of genes whose function is completely unknown. Hence, tools to more accurately predict functional annotations on a genome wide scale are sorely needed for research to progress.

The UW "Center for Pathway Inference" will bring together researchers from Computer Science, several basic biology departments (Medicine, Microbiology, Lab Medicine, Genome Sciences and others), Statistics and Biostatistics and Biomedical Health Informatics to build a team of scientists and instructors who will not only create and distribute the software tools and databases that will enable "Systems Biology" but who are also committed to creating and training scientists capable of working in this emerging discipline. As required by the request for applications, the center will consist of two principle computational cores that interact tightly with several "driving biological projects". In addition, four other cores - Infrastructure, Training and Education, Dissemination and Administration will support the research and training mission of the center. Each of the Core activities is described briefly below.

Core 1 - Structural and Functional Predictions
This core, headed by Ram Samudrala (Assistant Professor, Microbiology), will focus on improving algorithms and methods for predicting protein-protein, protein-RNA and protein DNA interactions and for predicting gene functional annotations. A major activity of Core 1 will be to expand and improve on Dr. Samudrala’s pre-existing "Bioverse" is a database and web server of predicted/measured functional, structural and interaction annotations for all genes in more than 50 genomes. At present, it is the only publicly available source of protein-protein interactions for the human genome. In addition, it is the only available source of predicted physical interaction networks for 45 of the other 49 genomes represented in Bioverse, including several versions of the rice genome and numerous pathogenic bacteria. The methods and underlying data structure of Bioverse will be modified to incorporate protein-RNA and protein-DNA interaction predictions and measurements and Core 1 gathered in Core 2 and the DBP’s.

Dr. Larry Ruzzo (Professor, Computer Science and Engineering, Co-Investigator), will focus the development of computational tools that assist in key steps in the discovery and characterization of non-coding RNAs (ncRNA’s) and on tools for sequence based RNA motif and RNA-protein interaction predictions. In the past several years, dramatic discoveries have greatly expanded both the number of known ncRNAs and the breadth of their biological roles, including significant regulatory and developmental roles. In short, ncRNAs are much more biologically significant than previously realized. Dr. Ruzzo’s goal to develop computational tools that assist in key steps in the discovery and characterization of ncRNAs and in motif prediction for RNA/protein binding. While there have been computational advances in this area, the state of the art in computational tools to describe and search for ncRNAs still lags considerably behind many other sequence analysis tasks. In particular, available models for ncRNA families are limited, and current tools for ncRNA model inference and sequence annotation are often challenging to use, inaccurate, infeasibly slow or all three. In addition, tools to accurately predict protein binging motifs in RNA need to be developed.

The overall goal of Core 1 is to produce improved tools and methods to accurately predict structures of, functions of and physical interactions between bio-molecules on a genome-wide scale and to represent those interactions in a web accessible database.

Core 2 - Expression Informatics
Roger Bumgarner, Associate Professor Microbiology, PI.
Co-Investigators Josh Akey, Assistant Professor Genome Sciences and Ka Yee Yeung, Research Assistant Professor, Microbiology

The expression informatics core will focus on:
  1. The development of tools and methods to analyze expression QTL data. We will develop open source software for the analysis of eQTL data, with an emphasis on methods applicable to outbred populations such as humans. The analysis of eQTL data presents unique computational challenges given the size and complexity of the experiments. However, we believe that with the appropriate methodological and statistical tools, the inherent complexity of eQTL data can be used to extract biologically meaningful information and provide important insights into the genetics of gene expression.
  2. Data mining of public data sources to supplement the physical interaction networks of core 1 with putative protein-DNA interactions derived from CHIP-to-chip data and eQTL/expression annotation of these putative interactions.
  3. The development of publicly accessible tools and algorithms to map "omics data" (expression array, proteomics, large-scale screening data, etc) onto the physical interaction networks generated in core 1.
  4. Tools to assist in the identification of relevant transcriptional modules and regulatory pathways from the integration researchers’ data with the networks in Bioverse and other sources of public data (e.g. related expression data).
Specifically, core 2 will focus on tools that will allow researchers to view their data in the context of the predicted interaction networks and other public data source. This data integration coupled with a web server will allow external researchers to generate testable transcriptional regulatory hypotheses. For example, for any gene or collection of genes of interest, it will be possible to ask:
  1. What other genes are co-expressed with this(these) gene(s) in a given subset of experiments (both from the researcher and in our database) AND
  2. Do the genes identified in (1) share (a) common expression QTL(s)? AND
  3. What are the predicted transcription factors within the QTL’s AND
  4. Do any of these transcription factors bind upstream of any of the co-expressed genes? AND
  5. Are the genes of interest closely interconnected in a predicted or known physical interaction network?
Core 2 will collaborate with Core 1 to design and implement a web server that allows external users to map expression data or gene lists onto any of the parameters in Bioverse. Core 2 will also develop tools for improved pattern recognition and for the analysis of eQTL data in outbred populations. Core 2 will work in close collaboration with the DBP’s and external researchers to develop and test our methods and visualization tools to assure that the software we develop is meeting the biological needs.

Core 3 - Driving Biological Projects
In order to adequately test our ability to infer pathways and to create testable hypotheses, and to drive the development of the software tools and user interfaces, Cores 1 and 2 of the Center will collaborate with a highly selected set of thematically integrated driving biological projects (DBP’s) in Core 3. These DBP’s were specifically selected to meet the following criteria.
  1. Each DBP has strong computational requirements to accomplish the biology of interest
  2. Each DBP will generate data and requirements that will test Core 1 and Core 2’s predictions and software. In addition
  3. Each DBP’s biological problem requires an improved understanding of the host innate immune response and will generate data that will shed light on the host innate immune response
While the tools developed in Cores 1 and 2 are generally applicable to a broad range of biological problems, we have intentionally chosen DPB’s that are thematically focused in the general area of understanding the host-innate immune response. The creation of a core of thematically and scientifically integrated DBP’s offers a number of advantages relative to a more arbitrary selection of DBPs. The proposed DBP’s are:
From the Center’s standpoint, the primary goal of Core3 it is to work with Cores 1 and 2 to drive the development of software and databases needed to understand the biology of interest and to assure that such tools are generalizable to other areas of study. From the researcher’s standpoint, the DBP projects are asking fundamental questions about the nature of the host innate immune response that require the computational approaches and tools developed in Cores 1 and 2.

In addition to the science in Cores 1-3, the center will be supported by a strong infrastructure core (Core 4), a training and education core (Core 5), a dissemination core (Core 6) and an administrative core (Core 7).

Core 4 - Infrastructure
Ram Samudrala, PI

Core 4 will consist predominately of a systems administrator and a database administrator. The infrastructure core will be primarily responsible for maintaining production computer systems, databases and software developed in Cores 1 and 2 (the cores will maintain their own development systems). In addition, in years 02 and on, Core 4 will provide a technical writer whose role will be to document code and to produce (in collaboration with Core 6) professional training materials and user manuals.

The primary goal of Core 4 is to provide the systems and technical writing support necessary to deploy software that is useable by external researchers and researchers within the DBP’s.

Core 5 - Education
Ira Kalet, PI

The overall goal of the Education Core is to build on the existing strong educational programs at the University of Washington in Biomedical and Health Informatics, Computer Science and Engineering and the biological sciences, by adding key components that will enhance cross-disciplinary education between the information and computing sciences and the biomedical sciences.
Core 5 will:
The primary goal of Core 5 is enhance cross-disciplinary education between the information and computing sciences and the biomedical sciences with a heavy emphasis on principles of software design, use and dissemination.

Core 6 - Dissemination
Peter Tarczy-Hornoch, PI

Core 6 will be focused on providing support for the dissemination of our software tools and databases. This will consist of a combination of web development, focused training and tutorials and professional software documentation.

The primary goal of Core 6 is to assure that the tools developed within the center are readily available to external researcher’s in the most usable form (e.g. easily distributed, well documented and associated with good training materials and courses) and that the software development efforts are responsive to user needs and inputs.

Core 7 - Administration
Roger Bumgarner, PI

Core 7 will provide the basic fiscal and secretarial support to run the center, to organize internal and external advisory board meetings and to assist in recruiting of new DBP’s and other interacting scientific proposals

The overall goal of Core 7, simply put, is to assure that the Center meets its scientific and public objectives in a timely and cost effective manner.

The center itself will serve as a core computational biology resource to researchers worldwide by providing tools, databases, methodologies and training to allow researchers to take a “systems biology” approach to their projects. We will create a web-based toolkit that will allow biologists to integrate and visualize their data in the context predicted and measured physical interactions, functional annotations and internal and externally generated data sources.

Hence, the overall goal of the Center is to assist biologists in the rapid generation of testable, mechanistic hypotheses that will provide a crucial bridge between their functional genomics data and biological understanding.

Why will the Center for Pathway Inference be a national resource?

As stated at the outset of this introduction, a fundamental issue facing modern biology is the development of methods and tools to convert genomic and functional genomics data into mechanistic understanding. We, and many others [2, 4, 6-12] are of the opinion that this is best accomplished by creating tools that allow one to visualize and analyze functional genomics data in the context of known and predicted networks and pathways. Indeed, a number of researchers outside of this center are taking similar approaches and are creating related open source tools for this kind of analysis (for example - Cytoscape [13], GenMAPP [14, 15], PathMAPA [16] and others). Our center approach builds upon this growing body of literature to create user friendly tools that are both web accessible and, most importantly, linked to a hosted database of predicted structures, physical interaction networks and functional annotations for more that 50 genomes. In addition, the center will continue to create and improve upon algorithms and methods to produce these predictions and functional annotations and will provide improved analysis tools for ncRNA sequence data, eQTL data, and for mining expression data in the context of other data sets.
The advantages of UW Center for Pathway Inference methods relative to other approaches are:
In addition to the software and databases we will create, the center is committed to training existing and new scientists in both the principles of robust software design and the emerging field of systems biology. Core 5 is designed with the teaching mission of the University in mind and has a strong emphasis on training both internal and external researchers in the principles of good software design. Through a combination of course development, seminars and workshops, Core 5 will greatly expand the biocomputation educational opportunities available at the UW elsewhere. In addition to these more "formal" educational efforts, the center will (of course) train graduate students and postdoctoral fellows via direct participation in the research mission.

Between all the cores, the center will fund a total of 6 graduate students and 6 postdoctoral fellows with the majority of these in Cores 1 and 2. Current and recent students of the faculty in the Center have been exceptionally productive. For example, KaYee Yeung (now a Research Assistant Professor but most recently a post-doctoral fellow in Dr. Bumgarner’s lab and prior to that, a Ph.D. student with Dr. Ruzzo), has 13 publications within the last four years including eight first author efforts (two of which were amongst the most highly cited papers in computer science in 2002 and 2004). Kai Wang, a second year graduate student in Dr. Samudrala’s lab, published six papers in 2004 alone (of which five were first author efforts). Ultimately, students and postdoctoral fellows that we train within the center (such as the two mentioned above), may make the biggest contribution from the center to the national resource as these individuals will go on to establish their own labs and train others.

Summary
Today, an ever-increasing number researchers apply the tools of functional genomics (array and proteomic measurements) to their biological systems. As this type of data continues to pour into the literature, it is becoming increasingly apparent that, in many cases, very little new biological understanding is being developed [17]. The UW Center for Pathway Inference (UW-CFPI) is designed to bridge the gap between the collection of large-scale functional genomics data and mechanistic understanding of the underlying biology that is responsible for the observations. Our primary efforts are focused around the creation of tools that will allow researchers to mine their data in the context of predicted physical interaction networks, functional annotations and other related data sets. We feel that tools such as the ones proposed in this center are critical if we are to fully reap the rewards of the effort and resources that have been put into the development and application of functional genomics tools such as expression arrays and proteomics methodologies. By building on existing software, database and human resources at the University of Washington, the proposed center is well positioned to become an important national resource for computational biology.

References
  1. Ideker, T., A systems approach to discovering signaling and regulatory pathways--or, how to digest large interaction networks into relevant pieces. Adv Exp Med Biol, 2004. 547: p. 21-30.
  2. Yeang, C.H., T. Ideker, and T. Jaakkola, Physical network models. J Comput Biol, 2004. 11(2-3): p. 243-62.
  3. Ideker, T., Systems biology 101--what you need to know. Nat Biotechnol, 2004. 22(4): p. 473-5.
  4. Ideker, T., T. Galitski, and L. Hood, A new approach to decoding life: systems biology. Annu Rev Genomics Hum Genet, 2001. 2: p. 343-72.
  5. Ideker, T., et al., Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science, 2001. 292(5518): p. 929-34.
  6. Ideker, T.E., V. Thorsson, and R.M. Karp, Discovery of regulatory interactions through perturbation: inference and experimental design. Pac Symp Biocomput, 2000: p. 305-16.
  7. Spivey, A., Systems biology: the big picture. Environ Health Perspect, 2004. 112(16): p. A938-43.
  8. Dhar, P.K., H. Zhu, and S.K. Mishra, Computational approach to systems biology: from fraction to integration and beyond. IEEE Trans Nanobioscience, 2004. 3(3): p. 144-52.
  9. Butcher, E.C., E.L. Berg, and E.J. Kunkel, Systems biology in drug discovery. Nat Biotechnol, 2004. 22(10): p. 1253-9.
  10. Ishii, N., et al., Toward large-scale modeling of the microbial cell for computer simulation. J Biotechnol, 2004. 113(1-3): p. 281-94.
  11. Covert, M.W., et al., Integrating high-throughput and computational data elucidates bacterial networks. Nature, 2004. 429(6987): p. 92-6.
  12. Huang, S., Back to the biology in systems biology: what can we learn from biomolecular networks? Brief Funct Genomic Proteomic, 2004. 2(4): p. 279-97.
  13. Shannon, P., et al., Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res, 2003. 13(11): p. 2498-504.
  14. Doniger, S.W., et al., MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data. Genome Biol, 2003. 4(1): p. R7.
  15. Dahlquist, K.D., et al., GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways. Nat Genet, 2002. 31(1): p. 19-20.
  16. Pan, D., et al., PathMAPA: a tool for displaying gene expression and performing statistical tests on metabolic pathways at multiple levels for Arabidopsis. BMC Bioinformatics, 2003. 4(1): p. 56.
  17. Quackenbush, J., Genomics. Microarrays--guilt by association. Science, 2003. 302(5643): p. 240-1.
Research Activities
Bioinformatics
Virology/Bacteriology
Centers
 - UW Toxico Genomics
 - UW - NHLBI Center
 - Center for Pathway Inference Software
Recent Publications
-- Nature Genetics - Integrating large-scale functional genomic data...

-- Nature Biotechnology - Min info spec for in situ hybridization...

-- Nature Biotechnology - Direct multiplexed measurement of gene expression...

-- J Virology - Attenuation of the type I interferon response...

-- J Virology - Independent and cooperative antiviral actions of beta interferon...

-- Carcinogenesis - Comparative genomics of susceptibility to mammary carcinogenesis...

-- OMICS - What is the best reference RNA?...

-- J Virology - Human rhinovirus attenuates the type I interferon response...

-- Science - Rhesus Macaque Genome Sequencing and Analysis Consortium...

-- Cell Microbiology - Hierarchical gene expressio profiles of HUVEC...