Supervised learning of oncogenic pathway signatures
I initially helped create this tutorial (along with my adviser, Christina Leslie, and my lab-mate Xuejing Li) for a lab session that was presented at the Integrative Statistical Analysis of Genome Scale Data (2008) workshop at Cold Spring Harbor Labs. The goal of this lab was to use SVMs to build differnet classifiers that can be used to extract oncogenic pathway signatures from microarray expression data. The signatures are then used to classify different types of cancer in several mouse models.
All code is written in R and utilizes several bioconductor packages.
I have since added more details that were not originally relevant for the lab in order to fulfill the requirements for a class project in Jason Banfelder's Quantitative Understanding in Biology course.
I'm posting it here in hopes that people may find it useful and educational. Topics covered include:
- The need for data normalization when using microarray data
- A short exposition on the theory and application of support vector machines (SVMs) in a biological setting
- Comparison of SVMs to simpler analysis methods.
- Brief discussion of SVM recursive feature elimination (SVM-RFE)
- Use of principal components analysis (PCA) to visualize high dimensional data
Although it is mentioned in the documents, I should say here that I do not intend for this to be a rigorous presentation of any of the aforementioned topics. I rather intend for the material to be presented in a fashion that provides the reader with an intuitive sense of what these techniques do and how they might be useful.
Please contact me with any question/comments/suggestions you might have.
The Goods
- The manuscript that covers the topics and analysis described here. This is also accessible using the
openVignette()command after loading thesvmlablibrary that you can download below. - The
Rpackage (~86 MB) package that provides the code and datasets necessary to follow along. Once the package is downloaded, you can install it on a "unixy" system via the command line by invokingR CMD install svmlab_1.2.tgzfrom the terminal. I'm not sure what the process is on windows, but I think you can install local packages from some menu selection. It may be easier, however, to replace your windows OS with ubuntu and then install the package "the real way". - The scripts. I reference in various places in the manuscript names of scripts you can run (
source) to generate the analysis/figures shown in the text. These scripts are supposed to be in aqbiodirectory that is installed in thesvmlabdirectory in your R library path after you install thesvmlabpackage. You can also download these scripts from here. You can justsetwdinto the directory after decompressing the download.source-ing them should give you the same analysis as the sections indicated below:- Section 3.1 and 3.2:
svm.nocheating.R - Section 4:
signature.pca.R - Section 5:
classify.mouse.R
- Section 3.1 and 3.2:
- The slides I used for my qbio presentation.
Miscellany
I think I've stressed the point enough in the manuscript, but just in case: the mathematical presentation of some of the concepts here (like the SVM) are not exactly correct and are only presented in a manner that I think provides intuition for pedagogical purposes.
I have references in the manuscript to resources one should read to get a complete and correct overview of the concepts presented
Lastly, I've left out a section in the preliminaries that I had initially inteded to write providing an overview of principal comonents analysis. If you are after a light and intuitive explanation of this technique, there is a great tutorial that you can find here (PDF).
Last updated on: July 1, 2008
