Review of “Statistical Challenges associated with detecting copy number variations with next-generation sequencing”

With the deluge of data streaming from Next Generation Sequencing (NGS) technologies, scientists are scrambling to find the best methods for Copy Number Variation (CNV) analysis. Of the two main types of NGS data used for CNV analysis – Whole Genome and Exome sequencing, the paper “Statistical Challenges associated with detecting copy number variations with next-generation sequencing” (Bioinformatics. 2012 Nov 1;28(21):2711-8.), reviews germ-line Whole Genome Sequencing (not tumors) data analysis for CNVs and focuses principally on Depth of Coverage (DoC) methods.

Next Generation Sequencing reads vary in length from 40 – 500 bp. Read files (eg. Fastq format) are mapped to a reference genome with alignment programs such as MAQ, MrsFast to create files (eg. BAM format) that contain the aligned location and quality scores. The paper reviews various approaches to detect CNVs from aligned reads.

There are four main methods for CNV detection from NGS data along with others that are combinations of these principal methods. These methods are Asembly-based (AS), Paired-End Mapping reads (PM), split read (SR) from pairs, depth of coverage (Depth Of Coverage) methods. Except for AS, the other methods need aligned files for processing. Unfortunately, with the current state of the art, the CNVs detected from these methods have only a 30-60% overlap (size, location) with each other. Moreover, there is less overlap with CNVs detected by array studies. The 1000-genomes pilot project for CNVs from NGS data detected smaller CNVs

AS methods are not suitable for the human genome because short read lengths make it difficult for de-novo assembly in repeat regions though combined assembly methods can resolve repeats by comparison to a reference genome.

SR methods (eg. Pindel) can pin-point breakpoints accurately but have to be combined with other means for identifying copy numbers.

PEM methods (eg. PEMer, BreakDancer, Variation Hunter) can detect inversions and translocations but are limited by the fragment size while detecting insertions.

Depth of Coverage (DOC) methods detect larger CNVs. These methods are based on the premise that the number of reads in a fixed window along the genome is a measure of gains or losses, or can be a proxy for CGH probe intensities, at the mid-point of the window. This appears to be too simplistic due to several issues related to both the complexity of the genome as well as the limitations of the current molecular state of technology. These include biases due to GC-content (due to PCR biases), read-quality, alignment of repeats (multi-reads) and library preparation which affect the number of reads. In addition, there are regions that report highly variable number of reads for different normal samples, and regions that are invariant (too low or too high reads) not supporting ploidy as seen here  - that cannot be used for CNV analysis. Harismendy et al (2009) found that unique sequences at equimolar quantities show DOC varying by 2 orders of magnitude.

The authors evaluate sensitivity in detecting high-confidence gains and losses (High gain/Big loss) from the list of CNVs reported by Conrad et al (2009) for a Hapmap sample NA12891, by varying GC correction algorithms and read-quality filtering discussed below. The paper focuses largely on sensitivity while false-positives were not reported.

GC content appears to affect the number of reads in a complex way. This bias might be due to the PCR amplification step. In general, regions with low or high GC content have low DOC, while in AT-rich regions DOC increases with increasing GC% and in GC-rich regions DOC decreases with increasing GC%. But the authors conclude that GC-correction and read-quality filtering (Phred quality >10 or 30) don’t seem crucial to sensitivity – perhaps because they are looking only at high-confidence CNVs. They show that GC-correction reduces variance in DOC, but not by much, speculating that this might reduce false-positives.

Mapping reads that align to multiple regions, to all the regions, to one region at random or to none of the regions, are all inadequate strategies to detect CNVs, especially in regions of Segmental duplication which are enriched in CNVs. Deep sequencing (> 20X) might help resolve this issue when combined with other methods.

Another issue is that all of the four methods detect deletions better than amplifications. In DOC methods 90% of deletions were detected but only 25% of amplifications because incremental changes (5 to 6 copies=20% more reads) are difficult to distinguish from the noise of large DOC variance.

One thought on “Review of “Statistical Challenges associated with detecting copy number variations with next-generation sequencing”