Sequence and chromatin determinants of cell-type specific transcription factor binding
Aaron Arvey1, Phaedra Agius1, William Stafford Noble2 and Christina Leslie1
1. Computational Biology Program, Memorial Sloan-Kettering Cancer Center, New York, NY
2. Department of Genome Sciences, University of Washington, Seattle, WA

Overview of study

Gene regulatory programs in distinct cell types are maintained in large part through the cell-type specific binding of transcription factors (TFs). The determinants of TF binding include its own DNA sequence preferences, DNA sequence preferences of co-factors, and the local cell-dependent chromatin context. To explore the contribution of DNA sequence preference, histone modifications, and DNase accessibility to cell-type specific binding, we analyzed over 250 ChIP-seq experiments performed by the ENCODE Consortium. This analysis included experiments for 70 transcription factors, 12 of which were profiled in both the GM12878 (lymphoblastoid) and K562 (erythroleukemic) human hematopoietic cell lines. To model DNA sequence preferences, we used support vector machines (SVMs) that use flexible k-mer patterns to model sequence preferences more accurately than traditional motif approaches. In addition, we used SVM-based chromatin signature models to capture the spatial distribution of histone modifications and DNase accessibility, obtaining significantly more accurate predictions than simpler approaches. Consistent with previous studies, we find that DNase accessibility can explain cell-line specific binding for many factors. However, in contrast to these studies, we find that some TFs display distinct cell-dependent sequence preferences that can be learned by training simultaneously on ChIP-seq data from multiple cell types. Moreover, we identify cell-specific binding sites that are accessible in both cell types but bound only in one. For these sites, cell-type specific sequence models, rather than DNase accessibility, are able to explain differential binding. Our results suggest that using a single motif for each TF and filtering for chromatin accessible loci is not always sufficient to accurately account for cell-type specific binding profiles.

Code and Data sets

  • Here is the matlab code for training and testing SVR models on PBM arrays or chip-seq data. Please be sure to download the LIBSVM matlab interface as this code is dependent on LIBMSVM.
  • The training and test sequences that we extracted for Gm12878, Helas3 and K562 are available for download here.

Related papers

Also see our paper High resolution models of transcription factor-DNA affinities improve in vitro and in vivo binding predictions