Bioinformatics Seminar: Computational Phenotype Prediction by Querying Public Microarray Data Repositories
| What |
|
|---|---|
| When |
Mar 20, 2007 from 11:00 am to 12:00 pm |
| Where | LT-415 |
| Add event to calendar |
|
Speaker
Jianjun Hu
Postdoc Research Fellow
Molecular and Computational Biology Section
University of Southern California
Abstract
One fundamental goal of genomics research is to elucidate the molecular underpinnings of complex phenotypic traits such as human diseases and to develop novel methods for diagnosing diseases. Currently, high-throughput genomic databases, gene expression microarray data repositories, and biomedical knowledge bases are being accumulated rapidly. However, few methods are available to effectively integrate these sources of information at different biological scales, for the purposes of phenotypic prediction and genotype-phenotype relationship discovery.
In this talk, we present a framework that exploits the large-scale National Center for Biological Information Gene Expression Omnibus (NCBI GEO) gene expression microarray dataset repository and the Unified Medical Language System (UMLS) biomedical ontology to predict phenotypes from gene expression profiles of unknown traits or diseases. For each reference dataset in the NCBI GEO, we extract descriptive keywords from their meshhead description of related publication and their GEO dataset summary and map them to UMLS concepts. The phenotype of a dataset is thus encoded into a list of UMLS concepts. We then define a cellular-state-switch signature as the ranked list of genes in terms of gene expression changes between two conditions or states. Given a query gene expression dataset, we query the NCBI microarray datasets based on the similarity of the cellular-state-signatures by calculating the ordered-list similarity scores between two profile-pairs. We then use the kernel logistic regression approach to predict the phenotypes of the query dataset in terms of UMLS concepts, which has the advantage of considering the relationships among all the reference datasets. To link phenotypic concepts to their underlying genes, we propose an order-statistics based ranking algorithm to screen out highly differentially expressed genes related to a given UMLS concept. Our experiments on a selected set of GEO datasets demonstrate the utility of public high-throughout biological datasets for phenotype prediction and for elucidating the genotype-phenotype relationships.

