Motif Element Detection Using Sequence Agglomeration

MEDUSA is an integrative method for learning motif models of transcription factor binding sites by incorporating promoter sequence and gene expression data.

Details

MEDUSA is a machine learning algorithm that integrates mRNA expression, promoter sequence, and ChIP-chip occupancy data to learn gene regulatory programs that predict the differential expression of target genes. MEDUSA does not rely on clustering or correlation of expression profiles to infer regulatory relationships. Instead, the algorithm learns to predict up/down expression of target genes by identifying condition-specific regulators and discovering DNA motifs, de novo from the promoter sequences, that may mediate their regulation of targets. We use boosting, a technique from machine learning, to avoid overfitting as the algorithm searches through a high dimensional feature space of potential regulators and sequence motifs.

We used MEDUSA to uncover detailed information about the heme and oxygen regulatory network in yeast using genome-wide expression changes in response to perturbations experiments. We used a novel margin-based score to extract significant condition-specific regulators and assemble a global map of the oxygen sensing and regulatory network. This network includes both known oxygen and heme regulators as well as many new candidate regulators. MEDUSA also identified many DNA motifs that are consistent with previous experimentally identified transcription factor binding sites. Since MEDUSA's regulatory program associates regulators to target genes through their promoter sequences, we directly tested the predicted regulators for OLE1, a gene specifically induced under hypoxia, by experimental analysis of the activity of its promoter. In each case, deletion of the candidate regulator resulted in the predicted effect on promoter activity, confirming that several novel regulators identified by MEDUSA are indeed involved in oxygen regulation.

Code

MEDUSA is implemented in MATLAB and is free for academic use. Any questions on its use should be directed to .

Please note that by downloading the code below, you are implicitly agreeing to use it for academic purposes only. If you would like to use it for other purposes, please contact .

Download MEDUSA

Datasets

The raw data used for the analysis in our PLoS Computational Biology hypoxia paper can be found on GEO under accession number GSE8343. Further processed data can be found below:

Acknowledgements

This work has been partially funded by:

  • The National Science Foundation (IIS-0835494)
  • An NIH NCBC award to the MAGNet Center