Leading the Way in Life Science Technologies

GEN Exclusives

More »

Feature Articles

More »
March 01, 2018 (Vol. 38, No. 5)

Sequencing Goes Long

Long-Read Sequencing Technologies Are Giving Researchers Valuable New Tools for De Novo Genome Assembly

Source: Illustration by Cristina Spanò

  • (Part II of a two-part series, for Part 1 click here)

    Illumina has dominated the genome sequencing market for a decade, essentially since it acquired the British company Solexa in 2006. Since then, the San Diego, CA firm has introduced a series of advances in engineering, chemistry, imaging, and software analysis to squeeze every ounce of throughput from the sequencing-by-synthesis chemistry originally developed by a pair of Cambridge University faculty, Shankar Balasubramanian and David Klenerman, in the late 1990s.

    But for all its prowess, the Illumina platform suffers one major drawback. The individual reads are tiny, in the 150–250 base range. Such short reads can hinder efforts to align and assemble genomes—particularly complex genomes such as plants—from scratch (de novo). Over the past five years or more, new technologies capable of delivering much longer reads—albeit at a higher price point—are giving researchers new tools and strategies for their genome analysis. The two major players in the long-read market are Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT).

    “Long reads are where it’s at,” Michael Snyder, Ph.D., director of the Center for Genomics and Personalized Medicine at Stanford University, quipped during a press briefing at the American Society of Human Genetics meeting in 2017 (ASHG 2017), prior to stopping to chat briefly with PacBio’s CSO, Jonas Korlach, Ph.D., about a wearables project for which Dr. Snyder hoped to recruit Dr. Korlach.

    Indeed, a central theme of ASHG 2017 was what long reads can do.

    Short reads are not ideal for de novo sequencing projects, because the stitching together of individual reads is naturally easier if the individual reads are longer—akin to doing a 100-piece jigsaw puzzle versus an identical 1,000-piece puzzle. Some structural variants are inevitably missed with short-read technologies; a recent study suggests “that the majority (~77%) of insertions are being missed by routine short-read calling algorithms.”1

    Long reads, instead, are the preferred option to investigate structural variations, copy number variants (CNVs), and DNA–protein interactions.2 Novel isoforms are best captured through long reads, according to research conducted by PacBio.3

    There is, however, a caveat: Most software tools have been developed for short reads,4 so matching software to the data for subsequent analysis can be challenging.


  • Comparing the Tools

    Click Image To Enlarge +
    Figure 1. A Sequel worth watching. Pictured is PacBio’s Sequel model, which is the newest iteration of the company’s technology. PacBio CSO Jonas Korlach, Ph.D., told GEN that a further doubling in throughput is “on the roadmap in 2017 and 2018,” and said that in 2019, the company will introduce a new chip that could increase throughput by a factor of eight over the next few years.

    The single-molecule nature of the PacBio platform means it has a very different data-error model than Illumina. “On the one hand, the raw error rate is very high (>10% compared with 0.5–1% for Thermo Fisher Scientific and 0.1–0.5% for Illumina),” says Shawn Baker, who is a genomics advisor and consultant at SanDiegOmics ([email protected]). “However, the errors are mostly random, meaning that by oversampling the data, the consensus error rate can be very, very low.”

    Meanwhile, both Thermo and Illumina platforms have more of a systematic error profile—oversampling helps, but only to a point, asserts Baker: “In practical terms, it means that PacBio is able to sequence regions of the genome that are intractable to other platforms. The major disadvantage of PacBio is its cost per Gb, currently 5–10 times higher than Illumina.”

    10x Genomics has tools for synthetic long-read sequencing.2 It can read up to 100 Kb and works by linking short read information together for an in silico look at the bigger picture of the long read. 10x Genomics organizes genetic information based on what is known as “read clouds” to map the larger picture of the genome. This method (which is Illumina’s in-house approach) requires more coverage than a typical short-read project. This requirement of additional coverage can contribute to an upward shift in overall cost.

    PacBio is improving throughput, and released its Sequel instrument (Figure 1) in 2015 to address throughput concerns. Dr. Korlach told GEN that a further doubling in throughput is “on the roadmap in 2017 and 2018,” and said that in 2019, the company will introduce a new chip that could increase throughput further by a factor of eight over the next few years.

    PacBio’s tool and ONT’s tool were directly compared in a paper in the open-access journal F1000Research. In the February 2017 study, researchers set out to characterize a transcriptome using both tools. They concluded that PacBio sequencers produced better quality data, although sequencing using ONT tools produced a higher yield.5

  • Click Image To Enlarge +
    Figure 2. Not all sequencing runs are created equal. The differences across brands and sequencing platforms are highlighted, with a look at reads per run, read length, run time, yield, rate, price of reagents per Gb of data, and price of run of the human genome with 30× coverage. Data were compiled by Albert Vilella Bertran, Ph.D., head of precision bioinformatics at Cambridge Epigenetix (used with permission).

    But PacBio’s and ONT’s tools should not be directly compared in this case, says Dr. Korlach, as the experiments for each were not identical: “2D ONT reads (which much of the analysis is based on) have been discontinued by ONT, so the work (described in the F1000 paper) cannot be reproduced by others in the scientific community. The PacBio work [described in the study] does not employ the Sequel system… [In] short, the PacBio performance is grossly outdated, which further biases the study.”

    “The authors also admit that there are no established protocols for ONT, and that their attempts at size selection failed,” Dr. Korlach adds. “In contrast, there are now more than 100 peer-reviewed publications featuring the Iso-Seq method with PacBio technology.”

    Baker acknowledges there aren’t too many head-on sequencing platform comparison studies in existence. This is likely because one of the major problems with comparison papers is timing. “They’re always out of date by the time they’re published,” Baker points out.

    One investigator, however—Albert Vilella Bertran, Ph.D., head of precision bioinformatics at Cambridge Epigenetix—created a live, editable chart6 comparing the various platforms in the sequencing space and the capabilities of each specific model (Figure 2).

  • DNA Sequencing in the Palm of Your Hand

    Click Image To Enlarge +
    Figure 3. Sequences in real time. According to Oxford Nanopore Technology (ONT), its nanopore-based sequencers are the only tools capable of producing very long contiguous reads in real time. Pictured here is the MinION model.

    If any technology has a shot at dislodging Illumina’s decade-long rule over the sequencing world, many commentators think it is nanopore sequencing. The best-known company in this space is ONT, although other companies, including Roche (through its acquisition of Genia Technologies) and Hitachi, are developing alternative nanopore systems.

    Unlike Illumina’s platform, which relies on optics and cameras to take snapshots of DNA, ONT’s tools rely on electric current to move DNA strands through a bacterial pore, producing a signal that allows sequence information to be inferred. Single-stranded DNA (ssDNA) is threaded through a microscopic pore, through which an ionic current is running. The current fluctuates depending on the shape and size of the base passing through the pore, revealing the true read directly from the strand, rather than “by proxy,” Gordon Sanghera, Ph.D., CEO of ONT, told GEN at ASHG 2017. In short, reads produced by ONT do not rely on the use of a template or a guide DNA strand—instead, the technology reads the native ssDNA molecule directly.

    ONT is best known for the ultra-portable handheld MinION, which is the size of a smartphone and connects via USB to a laptop. A higher-end benchtop instrument, the PromethION, contains 48 flow cells and provides much higher throughput. As Chief Technology Officer Clive Brown told GEN: “PromethION can produce more data, and more data per unit time, than Illumina’s public specifications for NovaSeq. In the case of PromethION, of course, the reads are very long, enable real-time analyses, and [allow] the direct reading of RNA and DNA modification—all from an hour or less of sample preparation.”

    ONT’s technology is capable of producing very long contiguous reads—up to four orders of magnitude larger than a short read. On November 1, 2017, James Ferguson, a genomic systems analyst at the Garvan Institute of Medical Research in Australia and cofounder at Cerebro Biosystems, announced on Twitter that his team reclaimed a sequencing record with a 970-Kb-long read, “All while looking at tricky structural variants.”7

    A month later, Martin A. Smith, Ph.D., head of genomic technologies in the Kinghorn Centre for Clinical Genomics at the Garvan Institute in Sydney, Australia, tweeted that he and his team had sequenced4 a single DNA read of more than one million bases long using ONT’s MinION nanopore sequencing device (Figure 3).

    And scientists are pushing the envelope continuously; new reports of longer and longer reads seem to show up on Twitter on a daily basis. (The current record is claimed by a group led Matt Loose, Ph.D., at the University of Nottingham, U.K.)

    While significant structural variations can be missed by some sequencing platforms, the longer the read, the better the chances that they will be captured. Still, says Baker, “Even the ultralong-read technologies like ONT would likely have some difficulties.”

  • Click Image To Enlarge +
    Figure 4. Desperately seeking sequencers. Illumina’s new iSeq 100 DNA sequencer is sold for less than $20,000. Illumina says this lower price (as compared with some of Illumina’s other tools that have the capacity to generate enough data for a human genome) now puts next-generation sequencing within reach for many more laboratories.

    The capital cost of MinION is very low at just $1,000. Compare this to the next cheapest system, which until recently was approximately $50,000—and the costs just rise from there, with some sequencers priced all the way up to $1 million. In January 2018, at the JP Morgan Healthcare conference, Illumina unveiled its new benchtop sequencer, the iSeq 100 (Figure 4), which occupies just one cubic foot, retails for less than $20,000, and can sequence more than 1 Gb (2–150-base reads) in under 20 hours.8

    Despite featuring long reads, ONT’s MinION does not have the throughput to compete with Illumina for human genome sequencing just yet. But the MinION’s portability is unrivaled: it has been used for numerous wide-ranging applications, from virus tracking in Africa and South America to scientific experiments on the International Space Station. There is also a rapidly growing user community for ONT (the company held a sold-out meeting for users in Manhattan in November 2017).

    One application of the tool, described as “MinION sketching,” involves the rapid reidentification of cell lines and tissues.9 This use addresses (at least partially) the growing reproducibility crisis occurring in current scientific research. The method relies on an algorithm that compares the sample with those in reference databases, which, according to researchers, helps resolve the high error rate that is associated with nanopore sequencing.

    While some reports calculate that the per-read error rate of the MinION is approximately 11%10 or 12%,11 ONT says the accuracy of the platform is 99.98% when a method called 1D squared is used.

    “The disadvantages of this platform are that the error rate, while improving, is substantially higher than the gold standard from Illumina,” Baker tells GEN. “Also, while individual runs are inexpensive (as little as $500), the cost per Gb of data is still on the high end.”

    In addition, ONT itself recognizes that its tools often miss homopolymers, while PacBio tools (which also provide long reads) do not.

    ONT’s SmidgION, which Dr. Sanghera said will start being commercially manufactured in 2018, has been promoted as another good option for quickly identifying pathogens during fast-moving outbreaks.12 The SmidgION plugs directly into a smartphone, and opens up many more applications than have likely already been identified. However, since SmidgION is still in final development, there are no or very few studies on this instrument’s performance (even though it is based on the same technology as MinION). The company demonstrated live base-calling on a mobile phone at the Nanopore Community Meeting in December 2017.

    But according to study authors, MinION’s relatively higher error rate may actually preclude “the reliable identification of strains on the basis of their multilocus sequence type, which may limit the MinION’s usefulness in mapping outbreaks of drug-resistant pathogens.”9 The same study also reports that the accuracy of whole-genome sequencing to predict antibiotic-resistance phenotypes is comparable to phenotypic drug-susceptibility testing for some common microbes.

    Still, the prospect that field scientists can throw their sequencer in their bag and sequence just about any living thing—virtually anywhere, including in resource-restricted settings—is a major advantage. And, as Dr. Sanghera emphasizes, ONT’s tools are the only ones that produce sequences in real time.

    And new developments in the nanopore sequencing space seem to be coming in real time, too. On January 29, 2018, researchers from the University of Nottingham, led by Dr. Loose, used a MinION device to perform a de novo assembly of the human genome. The assembly was then subsequently “cleaned up” via the use of a software tool called nanopolish.13 Nanopolish improves read accuracy by accounting for the bases flanking each nucleotide and by detecting base methylation patterns.

    Mapping the sequences against Illumina data improved the accuracy of the group’s reads even further. In fact, with their data, the researchers were able to close 12 of the remaining gaps in the reference human genome. The ability to generate this large a data set on such a tiny tool—accurately—is quite a feat, according to the researchers. As Dr. Loose concluded in a GEN interview, “Being able to read epigenetic information directly could be revolutionary.”

Related content