Our lab lab pioneered the use of "k-mer" based string kernels for support vector machine classification of protein sequences into structural categories. These novel and efficient-to-compute string kernels incorporate biologically motivated notions of inexact string matching, based on shared approximate occurrences of short subsequences ("k-mers"). More recently, we introduced profile kernels, which leverage evolutionary information in the form of sequence "profiles" estimated from multiple alignments, which achieve state-of-the-art performance for remote homology detection.

Related Papers

  1. Leslie C, et al.. Spectrum kernel: A string kernel for SVM protein classification. Procs of the Pacific Symposium on Biocomputing, January 2-7, 2002. [PubMed]
  2. Leslie C, et al. Mismatch String Kernels for Discriminative Protein Classification. Bioinformatics, 2004. [PubMed]
  3. Kuang R, et al. Profile-based string kernels for remote homology detection and motif extraction. Procs. of the IEEE Computational Systems, Bioninformatics August 2004. [PubMed]
  4. Leslie C, et al. Fast Kernels for Inexact String Matching. Procs. of the Conference on Learning Theory and Kernel Workshop, 2003. [ACM Portal]

This page provides information on downloading the string kernel code for spectrum/mismatch kernel [1,2] and profile kernel [3]. Code for other variants [4] of the string kernels will be available at a later date. The code for the spectrum/mismatch kernel and profile kernel are packaged together with sample data files and the motif extraction software (specifically for the profile kernel). The PSIBLAST profile for 7329 sequences (using 5 iterations) has been included, as well as the 54 experimental setup for the profile kernel experiments. You can design your own experiments and create your own set of profiles. Included are also license files and a number of README files which will facilitate your using of the software.

Note: A version of SPIDER is included in the distribution. The SVM training/testing requires MATLAB to work with SPIDER. For more information about spider, please see http://www.kyb.tuebingen.mpg.de/bs/people/spider/.

Release Notes

  • Version 1.2 - September 26, 2004, fixed bug in profile kernel code for trie data structure traversal. Also, package now uses SPIDER for SVM training and testing. This requires MATLAB.
  • Version 1.1 - July 30, 2004, fixed bug in run_scripts/normalize_matrix.pl.
  • Version 1.0 - March 30, 2004, Original release.

Download

Please note that by downloading the code below, you are implicitly agreeing to use it for academic purposes only. If you would like to use it for other purposes, please contact .

Download the string kernels software