Behind the paper: DNA damage is a pervasive cause of sequencing errors, directly confounding variant identification


Hi, my name's Lixin Chen. I'm a research scientist in the DNA Enzyme Division here at NEB. I'm a member of the Evans Lab. Our recent publication in the journal Science is titled "DNA damage is a pervasive cause of sequencing errors, directly confounding variant identification" and was performed as a collaboration with Laurence Ettwiller, a staff scientist, and Pingfang Liu, a development scientist here at NEB.


Laurence Ettwiller:

The study has major implication for the high throughput sequencing technologies and the detection of somatic variants. Somatic variants are sequences that are uniquely shared with only a few cells in our body. If somatic variants are the key genes responsible for the control of the cell proliferation they result in cancer. During the growth, cancer cells acquire additional somatic variants leading to populations or sub-populations of cancer cells and tumors heterogeneity. The sub-populations of cells can be resistant to otherwise working therapy and are therefore responsible for the relapse of the cancer. Thus, it is important to accurately identify all somatic mutation even the one that is specific to only a small sub-populations of cells.


Pingfang Liu:

The development of high throughput sequencing technology has revolutionized our understanding of cancer biology and the treatment of cancer. Sequencing of different tumor sample has identified many previously unknown somatic variants, many of which have been successfully applied in targeted cancer therapy. In addition, the ability to perform large scale sequencing of different tumor sample has enhanced our view of tumor heterogeneity. We now have a much better understanding of clonal selection and expansion, drug resistant, and individual variation in response to targeted therapy. The continued success in applying next generation sequencing to cancer therapy, while largely dependent on the agreed identification of somatic variants of high sensitivity and high specificity.


Tom Evans:

Sequencing accuracy directly impacts the identification of somatic variants. Not surprisingly, sequencing chemistry is not perfect and every so often, errors occur. These errors were usually ascribed to amplification mistakes or sequencing chemistry. With modern high accuracy polymerases and sequencing chemistry, other sources of error can become problematic. What makes this study so important is the discovery that DNA damage can be a prevalent source of sequencing error and fixing these damages greatly improves the detection of true somatic variance.


Not all damage leads to sequencing errors. Some lead to a loss of signal while others are more insidious, causing a polymerase to read the damaged base differently than the undamaged base. Guanine bases are known to react with oxygen forming 8-oxoguanine, which is read by many polymerases as a T instead of a G. This is insidious because unlike strand breaks where no information is generated, you instead get incorrect data, which can look like a rare somatic variant.


Lixin Chen:

In this work, we identify DNA damage as a prevalent source of sequencing errors by using two independent strategies. The first strategy uses a DNA repair mix such as preCR, ODFFP DNA repair mix to repair the damage present in the DNA sample. By sequencing the same sample with and without action of DNA repair, we can quantify the level of damage present in a DNA sample.


The second strategy uses the fact that a damaged triggers the polymerase to always respond with a similar behavior. For example, as Tom mentioned, 8-oxoguanine is misread by the polymer as a thymine instead of guanine. Sequencing of 8-oxoguanine containing DNA will result in an imbalance between the rate of G to T compared to the rate of C to A variants, which is a reverse complement of the G to T. The difference between these two rates corresponds to damage. Based on this principle, we computed a simple metric called the GIV score to measure all damage in the DNA and validated this metric using the experimental setup described by Lixin.


Laurence Ettwiller:

The next question that we wanted to address now is how prevalent is this damage in sequencing facilities that handle real cancer samples. We therefore requested access to the Cancer Genome Atlas initiative data and found that more than 80% of the data set has significant imbalance of G to T indicative of damage. It is likely that the majority of sequencing currently done has large degree of undetected damages. Consequently, the rare and possibly interesting variants are confounded by incorrect data derived from this damage.


Lixin Chen:

For more details about this work, you can read our publication in Science in the February 17 issue, 2017. This work is also available in open access on bioRxiv.

Loading Spinner