We strongly believe in the importance of reproducible computational results and in open source software. We make all our software tools freely available for research, education, and non-profit use as soon as they are ready for public release, and we provide source code wherever possible. In a few cases, our code depends on commercial software packages or proprietary code of collaborators; in these situations, we can usually still provide our portion of the source code.
-
MEDUSA is an integrative method for learning motif models of transcription factor binding sites by incorporating promoter sequence and gene expression data. We use a modern large-margin machine learning approach, based on boosting, to enable feature selection from the high-dimensional search space of candidate binding sequences while avoiding overfitting.
-
Rankprop is a publicly available web server that can be used to search for similar proteins from a query protein sequence. At its core, Rankprop is a ranking algorithm that exploits the global network structure of similarity relationships among proteins in a database by performing a diffusion operation on a protein similarity network with weighted edges.
-
SVM-Fold is a publicly available web server that uses SVMs to predict family, superfamily, and fold-level classifications for a query protein sequence based on the Structural Classification of Proteins (SCOP).
SVM-Fold detects subtle protein sequence similarities by learning from all available annotated proteins, as well as utilizing potential hits as identified by PSI-BLAST. Predictions of classes of proteins that do not have any known example with a significant pairwise PSI-BLAST E-value can still be found using SVMs.
-
String Kernels represents the pioneering work our lab developed in the use of "k-mer" based string kernels for support vector machine classification of protein sequences into structural categories. These novel and efficient-to-compute string kernels incorporate biologically motivated notions of inexact string matching, based on shared approximate occurrences of short subsequences ("k-mers"). More recently, we introduced profile kernels, which leverage evolutionary information in the form of sequence "profiles" estimated from multiple alignments, which achieve state-of-the-art performance for remote homology detection.