We strongly believe in the importance of reproducible computational results and in open source software. We make all our software tools freely available for research, education, and non-profit use as soon as they are ready for public release, and we provide source code wherever possible. In a few cases, our code depends on commercial software packages or proprietary code of collaborators; in these situations, we can usually still provide our portion of the source code.



    MEDUSA is an integrative method for learning motif models of transcription factor binding sites by incorporating promoter sequence and gene expression data. We use a modern large-margin machine learning approach, based on boosting, to enable feature selection from the high-dimensional search space of candidate binding sequences while avoiding overfitting.

  • Rankprop


    Rankprop is a publicly available web server that can be used to search for similar proteins from a query protein sequence. At its core, Rankprop is a ranking algorithm that exploits the global network structure of similarity relationships among proteins in a database by performing a diffusion operation on a protein similarity network with weighted edges.

  • SVM-Fold


    SVM-Fold is a publicly available web server that uses SVMs to predict family, superfamily, and fold-level classifications for a query protein sequence based on the Structural Classification of Proteins (SCOP).

    SVM-Fold detects subtle protein sequence similarities by learning from all available annotated proteins, as well as utilizing potential hits as identified by PSI-BLAST. Predictions of classes of proteins that do not have any known example with a significant pairwise PSI-BLAST E-value can still be found using SVMs.

  • String Kernels

    String Kernels represents the pioneering work our lab developed in the use of "k-mer" based string kernels for support vector machine classification of protein sequences into structural categories. These novel and efficient-to-compute string kernels incorporate biologically motivated notions of inexact string matching, based on shared approximate occurrences of short subsequences ("k-mers"). More recently, we introduced profile kernels, which leverage evolutionary information in the form of sequence "profiles" estimated from multiple alignments, which achieve state-of-the-art performance for remote homology detection.