
Motif Element Detection Using Sequence Agglomeration
MEDUSA is an integrative method for learning motif models of transcription factor binding sites by incorporating promoter sequence and gene expression data.
Details
MEDUSA is a machine learning algorithm that integrates mRNA expression, promoter sequence, and ChIP-chip occupancy data to learn gene regulatory programs that predict the differential expression of target genes. MEDUSA does not rely on clustering or correlation of expression profiles to infer regulatory relationships. Instead, the algorithm learns to predict up/down expression of target genes by identifying condition-specific regulators and discovering DNA motifs, de novo from the promoter sequences, that may mediate their regulation of targets. We use boosting, a technique from machine learning, to avoid overfitting as the algorithm searches through a high dimensional feature space of potential regulators and sequence motifs.
We used MEDUSA to uncover detailed information about the heme and oxygen regulatory network in yeast using genome-wide expression changes in response to perturbations experiments. We used a novel margin-based score to extract significant condition-specific regulators and assemble a global map of the oxygen sensing and regulatory network. This network includes both known oxygen and heme regulators as well as many new candidate regulators. MEDUSA also identified many DNA motifs that are consistent with previous experimentally identified transcription factor binding sites. Since MEDUSA's regulatory program associates regulators to target genes through their promoter sequences, we directly tested the predicted regulators for OLE1, a gene specifically induced under hypoxia, by experimental analysis of the activity of its promoter. In each case, deletion of the candidate regulator resulted in the predicted effect on promoter activity, confirming that several novel regulators identified by MEDUSA are indeed involved in oxygen regulation.
Code
MEDUSA is implemented in MATLAB and is free for academic use. Any questions on its use should be directed to .
Please note that by downloading the code below, you are implicitly agreeing to use it for academic purposes only. If you would like to use it for other purposes, please contact .
Datasets
The raw data used for the analysis in our PLoS Computational Biology hypoxia paper can be found on GEO under accession number GSE8343. Further processed data can be found below:
- Supplemental data and results
- Scatter plot of gene expression data in replicate experiments of the Aerobic (HAP1) condition. [444K]
- Discretized gene expression data for all experiments that is used in the MEDUSA learning procedure. [227K]
- List of differentially expressed genes (UP – upregulated, DOWN – downregulated) in each of the conditions discussed in the paper. [52K]
- Properties of the Alternating decision tree learned by MEDUSA. The file lists the regulators and motifs learned by MEDUSA at each iteration. It also lists nodes that precede and follow each node in the ADT. [28K]
- 1000 bp upstream sequences for all target genes in FASTA format, source: SGD. [2.6M]
-
MATLAB data for the hypoxia dataset [836K] containing:
cexp
: discretized target gene expressionexptnames
: experiment namesfoldchange
: real-valued log_2(foldchange) target gene expressionparents_names
: names of candidate set of regulatorspexp
: discretized regulator expressiontargets_names
: names of target genes
Acknowledgements
This work has been partially funded by:
- The National Science Foundation (IIS-0835494)
- An NIH NCBC award to the MAGNet Center