A ChIC Solution For ChIP-seq Quality Assessment - BioRxiv

INTRODUCTION

Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) is a widely adopted technique for genome-wide mapping of transcription factor (TF) binding sites and chromatin marks distribution [1]. Variability in data quality can be introduced by several factors, such as antibody quality [2] or various experimental protocol steps [3, 4], and may cause substantial differences in the enrichment of the signal compared to the background [5–7]. Quality control (QC) is therefore an important analysis step. Several metrics and guidelines have been proposed [3, 4, 8] but there is still no consensus on QC procedures. Among the many tools for ChIP-seq analysis that have been introduced over the years [6, 9, 10], including peak callers [11–15] or complete analysis pipelines [16, 17], only a few of them report quality metrics. The QC is often limited to generic read-level statistics on the FASTQ file [18] or to selected metrics [10, 19, 20] that do not provide a completely generalizable solution.

Indeed, a critical challenge in quality assessment is the variety of ChIP targets yielding enrichment profiles with distinct characteristics. Namely, different chromatin marks have peculiar distributions along the genome, most notably resulting in peaks with sharp or broad shape in the ChIP-seq enrichment profile [1]. For this reason, they may not be compliant with general guidelines and often have to be reconsidered on a case-by-case basis, as explicitly discussed also in the reference article on this topic by the ENCODE consortium [3]. Consequently, quality assessment is still a partially subjective operation, influenced by the experience of the data curator.

Moreover, a common view shared by literature in the field [4, 7, 21] is that a single score providing a reliable summary of the quality of a ChIP-seq sample would be convenient for end users. Despite a few solutions have been proposed [4, 7, 21, 22], there is not yet a consensus in literature on an unbiased single score to reliably discriminate between good and poor-quality samples. Thus, a broad ensemble of parameters must be considered [3].

To address these limitations, we propose a new ChIP-seq quality Control framework named “ChIC”. The key rationale of our framework is that the expected shape of the enrichment profile must be considered while assessing ChIP-seq data quality. For this reason, we introduce a new set of metrics, the Local enrichment profile Metrics (LM), by defining quantitative scores that describe the shape of ChIP-seq enrichment profiles (see methods). These metrics are not replacing, but instead extending previously proposed quantitative metrics for scoring ChIP-seq data, that are also computed as part of a comprehensive set of quantitative QC-metrics including ENCODE proposed metrics (EM) and metrics describing the global enrichment (GM) in the ChIP-seq experiment. This comprehensive set of metrics is leveraged to build a machine learning classifier to obtain a single score reliably summarizing the data quality and discriminating between good and bad quality ChIP-seq samples.

To the best of our knowledge, this is the first ChIP-seq QC method directly considering the shape of enrichment, the first one translating the shape of enrichment into quantitative metrics and the first based on machine learning. We also extensively benchmark our method against previously proposed solutions achieving consistently high performances. Finally, ChIC is implemented as a user-friendly R package, compliant with Bioconductor strict standards of stability and documentation. The latest stable version is available at https://bioconductor.org/packages/devel/bioc/html/ChIC.html.

Từ khóa » Chip Qc