Our lab develops novel computational methods to study cellular biological systems from a global and data-driven perspective. Our main focus is deciphering the regulation of gene expression, both in terms of the molecular networks that implement regulatory processes as well as the regulatory sequence information encoded in the genome. We work closely with experimental collaborators so that computationally-derived hypotheses can be validated in the lab.
Transcriptional Regulatory Networks
We are interested in learning gene regulatory programs that accurately predict genome-wide differential mRNA expression under different cellular conditions and extracting testable hypotheses about transcriptional regulatory networks. We have developed algorithmic approaches that integrate promoter sequence, mRNA expression data, and ChIP on chip binding data to learn gene regulatory programs and discover transcription factor binding motifs. We have used such an approach, called the MEDUSA algorithm, to model the oxygen and heme regulatory network in yeast, and we discovered novel candidate oxygen regulators that our experimental collaborators were able to biochemically validate.
Gene Silencing by microRNAs
Most current computational methods for predicting microRNA targets rely on searching for sites complementary to the "seed" region of the microRNA and filtering results based on cross-species comparisons or other requirements. However, these existing approaches still apparently return large numbers of false positives while failing to achieve reasonable sensitivity; for example, current target prediction programs can only account for a small fraction of genes that are downregulated after microRNA transfection. To improve upon these rule-based approaches, we are developing a supervised learning method to predict microRNA target site efficiency using data from genome-wide expression profiling following microRNA transfection experiments. We are also extending this algorithm to integrate cell-specific endogenous microRNA profiles, in order to perform cell-specific target prediction. Finally, we are beginning to explore models of microRNA that take quantitative account of the concentrations of RISC and other protein machinery in the microRNA pathway, including the possibility of competition of microRNAs for limited protein resources.
Remote Protein Homology Detection
Recognizing a protein's fold from its primary sequence of amino acids is a long-standing problem in computational biology. Traditional approaches use pairwise sequence comparison or protein family models based on multiple sequence alignments to infer structural relationships from sequence similarity. However, these methods may not perform well in the remote homology detection setting, where the protein sequence to be classified is only remotely homologous to known protein families. Our lab introduced the use of biologically- motivated k-mer based "string kernels" for support vector machine (SVM) classification of protein sequences into structural categories, achieving state-of-the-art performance for remote homology detection.



