Whole genome assembly from next generation sequencing data using restriction and nicking enzymes in optical mapping and proximity-based ligation strategies

High throughput sequencing methods have revolutionized genomic analysis by producing millions of sequence reads from an organism’s DNA at an ever decreasing cost.  However, a number of obstacles challenge our ability to generate contiguous chromosome-sized assemblies from the typically short sequence reads obtained. These include large regions of repetitive DNA, paralogous gene families and interspersed retrotransposable elements, which together often comprise up to 70% of an organism’s genome (1). A number of long read technologies, such as PacBio RS II sequencing, successfully traverse many of these repetitive elements, but are associated with higher costs, and even these improved sequencing methods often fall short of complete chromosome assembly.

Alternative innovative strategies are overcoming the challenge of generating long contiguous genomic assemblies from short sequence reads. Two broad methodologies are: i) proximity ligation-based sequencing and ii) optical mapping. These techniques are highly dependent on the use of restriction enzymes or nicking endonucleases to either cleave or label DNA at specific sites, with an optimal frequency for downstream analysis.

Proximity-based ligation

Proximity based ligation coupled with massively parallel sequencing, is exemplified by the Hi-C method (2) which probes the three-dimensional architecture of whole genomes by identifying higher order chromatin interactions. In the Hi-C method, cells are treated with the crosslinking reagent formaldehyde; DNA is then digested with a restriction enzyme that leaves a 5′-overhang. The overhang is filled-in using a dNTP mix that includes a biotinylated nucleotide triphosphate. The resulting blunt-end fragments are ligated under dilute conditions, which favor ligation events between the crosslinked DNA fragments. The resulting DNA sample contains ligation products consisting of fragments that were originally in close spatial proximity in the nucleus, marked with biotin at the fragment junctions. A Hi-C library is created by shearing the DNA and selecting the biotin-containing fragments with streptavidin beads. The number of read pairs between intrachromosomal regions is a decreasing function of the distance between them.

Inherent limitations in chromatin capture methods, such as Hi-C, are the requirement for living cells and the enrichment of undesired interchromosomal associations, such as those attributable to in vivo telomere clusters. The related “Chicago” method, developed by Dovetail Genomics, overcomes these limitations by capturing linkages from high molecular weight DNA mixed with reconstituted chromatin in vitro (3). In both the Hi-C and Chicago proximity ligation-based approaches, the choice of restriction endonuclease is critical for linking information (distance in kb) provided by the read pairs. Frequently parallel libraries are prepared using distinct restriction enzymes to provide linking information differing by several hundred kilobases (3). With approximately 300 restriction endonucleases available, it is easy to select enzymes that provide the desired fragment sizes.

Optical mapping

Optical mapping encompasses various techniques for fluorescent imaging of linearly extended DNA molecules, to document sequence-specific patterns across large genomic regions (4). Although optical mapping techniques can also address questions concerning specific genomic loci, epigenetic modification, DNA binding protein distribution, and genomic structural variation (5), the traditional and most widespread application is in the production of ordered, high resolution genome-wide pattern maps. Large individual DNA molecules are typically immobilized on a charged surface or held in solution in an extended state using nanochannels or extension flow devices. The DNA is then digested with an appropriate restriction enzyme. Cleaved molecules retract at the cut sites to leave a gap. The DNA fragments are labeled with a fluorescent dye, then visualized by microscopy. The fluorescent intensity of each fragment correlates with the fragment size.  Frequently, optical maps are produced with different restriction enzymes to build a consensus genome map. Argus, an automated optical mapping system, has been developed by OpGen Inc.  

A variation of optical mapping that employs nicking endonucleases to hydrolyze just one strand of DNA, has been developed by BioNano Genomics. Long DNA molecules are electrokinetically driven into nanochannels where they are held in an extended state. The DNA is then nicked, labeled at the nick site by incorporation of fluorescently labeled nucleotides, and ligated with Taq DNA ligase. This approach has been successfully applied to complex genomes including the human genome (6,7). The increasing commercial availability of nicking endonucleases will likely promote wide acceptance of this technology in the genomic community.   

When paired with next generation sequencing (NGS), optical mapping offers a powerful solution to the time consuming and costly processes of genome assembly and gap closure. Chromosome-sized optical maps provide a scaffold onto which sequence contigs can be oriented and aligned by overlaying in silico restriction digest or nick site patterns of the contigs on to the maps (8). Interfacing NGS with optical mapping facilitates de novo sequencing and assembly of large mammalian genomes in the absence of any reference genome (9). As with proximity-based ligation methods, the great diversity of restriction enzyme specificities enables optimization of the cut-site frequency. This, in turn, maximizes the alignment of sequence contigs to the optical map, and therefore the extent of genome assembly.   

  1. de Koning, A.P., Gu, W., Castoe, T.A., Batzer, M.A. and Pollock, D.D. (2011) Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet 7: e1002384. PMID: 22144907
  2. Lieberman-Aiden, E., van Berkum, N.L., Williams, L., Imakaev, M., Ragoczy, T., et al. (2009) Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326: 289-293. PMID: 19815776
  3. Putnam, N.H., O'Connell, B.L., Stites, J.C., Rice, B.J., Blanchette, M., et al. (2016) Chromosome-scale shotgun assembly using an in vitro method for long-range linkage. Genome Res. 26: 342-350. PMID: 26848124
  4. Dorfman, K.D., King, S.B., Olson, D.W., Thomas, J.D. and Tree, D.R. (2013) Beyond gel electrophoresis: microfluidic separations, fluorescence burst analysis, and DNA stretching. Chem. Rev. 113: 2584-2667. PMID: 23140825
  5. Levy-Sakin, M. and Ebenstein, Y. (2013) Beyond sequencing: optical mapping of DNA in the age of nanotechnology and nanoscopy. Curr. Opin. Biotechnol. 24: 690-698. PMID: 23428595
  6. Mostovoy, Y., Levy-Sakin, M., Lam, J., Lam, E.T., Hastie, A.R., et al. (2016) A hybrid approach for de novo human genome sequence assembly and phasing. Nat. Methods. 13(7): 587-590. PMID: 27159086
  7.  Xiao, S., Li, J., Ma, F., Fang, L., Xu, S., et al. (2015) Rapid construction of genome map for large yellow croaker (Larimichthys crocea) by the whole-genome mapping in BioNano Genomics Irys system. BMC Genomics 16: 670. PMID: 26336087
  8. Nagarajan, N., Cook, C., Di Bonaventura, M., Ge, H., Richards, A., et al. (2010) Finishing genomes with limited resources: lessons from an ensemble of microbial genomes. BMC Genomics 11: 242. PMID: 20398345
  9. Dong, Y., Xie, M., Jiang, Y., Xiao, N., Du, X., et al. (2013) Sequencing and automated whole-genome optical mapping of the genome of a domestic goat (Capra hircus). Nat. Biotechnol. 31: 135-141. PMID: 23263233