Send to printer »

Feature Articles : Sep 1, 2012 (Vol. 32, No. 15)

Big Data Requires Big Solutions

  • Kate Marusina, Ph.D.

Next-generation sequencing (NGS) brought us an ability to produce maps of whole human genomes in less than a week for just a few thousand dollars. With this ability came a tsunami of data.

Because we cannot simply read the entire genome end to end, NGS generates a very large number of small reads from random locations in the genome. The reads are assembled in larger contigs, and contigs into genomes. This process generates about 100 bytes of compressed data for each base pair and 100 GB for each human genome. Worldwide, the volume of sequencing data is rapidly approaching the exabyte (1018 bytes), with an astounding 5x year-on-year growth rate.

The future development of the field is in clear need of very efficient and rapidly scalable methods of dealing with data storage and analysis. And in order to use the data in a clinically meaningful way, we need methods for assessment of the quality of the resulting sequences.

CHI’s “Next-Generation Sequencing Data Analysis” conference was dedicated to evaluation of progress in the pipeline of computational technologies. The technologies selected for the conference and for this article highlight diverse approaches of dealing with big data.

“Next-generation sequencing instruments can generate up to two terabytes of data per run per sequencer,” said Sanjay Joshi, CTO life sciences, Isilon storage division, EMC. “The ability of sequencing technologies to deliver data drowns out our capability to process and store the raw data and results.”

The EMC storage solution, Isilon OneFS, enables users to grow storage capacity seamlessly, in full sync and linearly with data growth. “Isilon adds storage as simply as Lego blocks,” continued Joshi. “Adding more and more identical nodes to the existing architecture provides so-called scale-out solutions for life sciences computing, the field with practically unlimited needs.”

Isilon OneFS has multiple distinctive characteristics, according to Joshi. In addition to infinite scalability, it spreads the metadata that describes each datafile intelligently across all storage nodes in the system. Therefore, each node “knows” what the other nodes are engaged in, and that eliminates any individual points of failure within the storage cluster. In essence, Isilon OneFS is a self-healing system.

Its nodular structure is ideally suited for multiparallel computing protocols that are indispensable when dealing with assembly of millions of short DNA pieces. Multiple storage needs, simultaneously supported by the same Isilon framework, can be rather diverse in nature.

At Harvard Medical School (HMS), Isilon One FS is connected to a supercomputing center to serve genomics and image-analysis needs. At the same time, it stores learning courses with multimedia applications and HMS administrative workflows.

Isilon found another application at the Laboratory of Neuroimaging at University of California, Los Angeles (LONI), where it stores what is reportedly the largest collection of neuroimaging data in the world, exceeding 430 terabytes. The LONI brain scans represent a unique collection of 2D images that can be stacked to reconstruct 3D brain images.

“Computational challenges of image processing on this scale are daunting,” continued Joshi. “But storing and retrieving the 200 GB images is what often impeded the work of researchers around the word.”

Deployment of Isilon’s storage environment enabled LONI to double the processing speed and reduce network bottlenecks. “Data security and availability are absolutely critical when dealing with potentially identifiable health information. Transfer of information over the internet would inherently be less secure than the Isilon solution within a private cloud context,” Joshi concluded.

No Need for Dedicated Hard-/Software

“Web-based platforms present an ideal infinitely scalable solution for handling big data,” countered Andreas Sundquist, Ph.D., CEO and co-founder, DNAnexus. “The data goes straight from the sequencers into the cloud over the secure protocol. Our customers can access all storage and data-visualization tools without investing in expensive hardware infrastructure.”

DNAnexus rents a segment of the Amazon cloud. Customers acquire services on demand, and the pricing mirrors the data usage. Because of this infinite elasticity, DNAnexus can support sequencing operations of virtually any size, from a single machine to a full-scale sequencing center.

“For DNA sequencing to have real application in healthcare, clinicians should be able to generate the data and receive the results even if they do not have access to a large computational center,” continued Dr. Sundquist. “The analysis will also have to be considerably simplified.”

The success of a 100% outsourcing approach is exemplified by discovery of an unstable variant of dystonin, a protein used in the cytoskeleton. This mutation causes hereditary loss of function in peripheral sensory nerves.

The mutation was identified by a small nonprofit organization, Bonei Olam, which does not have its own DNA assembly, analysis, or storage capabilities. Instead, the scientists contracted DNAnexus to provide the entire workflow from alignment of raw reads to graphical display of matches to the reference genome. In the future, DNAnexus plans to expand such clinically relevant workflows.

A collaboration with Geisinger Health System and University of California, San Francisco will enable DNAnexus to learn from clinical experts how to build innovative solutions for personalized healthcare. DNAnexus will soon be launching an instant genomic and data analysis center that enables collaboration in a unified environment with just a click of a button.

“The cloud is a powerful tool to build absolute best security,” continued Dr. Sundquist. DNAnexus emphasizes multiple approaches to protect personal health information in the cloud environment: physical protection of servers with round-the-clock surveillance, data encryption, audit trails for data access, among others.

“Just a few years ago pharmaceutical companies would not consider the cloud as even remotely possible,” continued Dr. Sundquist. “But now, enhanced data security allows for storing various sensitive data in the cloud, including employee records and financial data.”

Bedside Diagnostics

“The future of next-generation computing is not in the cloud,” argued Matthew R. Keyser, NSG applications specialist, DNASTAR. “In the near future genome sequencing will no longer be tied up in core facilities. NGS will be performed at the bedside, and analysis will be done on a laptop.”

DNASTAR provides a suite of genomic applications, including assembly and alignment algorithms that could be efficiently run on any personal computer. DNASTAR workflows are compatible with most next-gen sequencers including those from Illumina, Roche, and Ion Torrent.

“Our proprietary software algorithms are built to maximize hardware and memory usage,” continued Keyser. “DNASTAR enhanced processing power means that a 4.6 MB E. coli genome could be assembled in less than seven minutes on any Windows or a Mac desktop computer. Most of the open-source assembly and analysis software requires investment in Linux. And DNASTAR’s processing speed is achieved without the need for multiparallel computation.”

Comparison of six different assemblers performed by the Institute of Evolutionary Biology at the University of Edinburgh found that the DNASTAR SeqMan assembler generated a large proportion of novel sequences and resulted in the best alignment to the reference sequences. SeqMan capabilities are readily exploited for metagenome analysis, as exemplified by a study of the viromes of three North American bat species.

SeqMan was one of the three assemblers used in this study that identified several novel coronaviruses out of a pool of viruses. The company has just been awarded an NIH grant to further enhance a metagenomics analysis pipeline for SeqMan. “Our next challenge is automation of microbial genome assembly,” said Keyser.

“Bacterial genomes are surprisingly difficult to complete, especially for a novel organism. Open-source assemblers are not capable to resolve the repetitive areas, meaning that many gaps have to be ‘closed’ manually.

“Moreover, most of the other assemblers provide text files as the output, whereas SeqMan produces a fully editable project file, which allows the end user to edit individual sequences, edit contigs (split, merge), order contigs into scaffolds, and use specialized alignment algorithms to close gaps. I am not aware of any other software that provides as complete an interface for microbial genome assembly, gap closure, and annotation.”

Ready for Medical Grade?

“One of the big problems with big data is lack of quality standards and, therefore, lack of performance metrics such as accuracy of assemblers, accuracy of genotyping calls, detection limits of variants, etc.,” said Justin Johnson, director of bioinformatics, EdgeBio. “How do we know that the next-gen sequencing results are, indeed, accurate?”

EdgeBio leads the development of the validation protocol underwritten by the X Prize Foundation, a nonprofit organization that creates and manages global competitions to solve challenges facing humanity. The Archon Genomics Xprize presented by Express Scripts, a $10 million award, will be given to the first team to sequence the genomes of 100 centenarians in 30 days cheaply, accurately, and completely.

The genomes must be sequenced with an error rate of one in one million bases. At this level of quality, the resulting sequences are moving toward “medical-grade”, meaning that the data may be used in clinical care decisions. The purpose of the validation protocol is first to develop an answer key, and second to create an automated scoring system against the answer key.

To create the answer key, EdgeBio made 5,000 fosmids (cloned portions of the genome, about 200 MB) from two well-known reference samples, Yoruba Male and CEU Female. The fosmids were sequenced by three different methods to reveal the extent of the bias due to a particular sequencing platform.

“About 15% of single nucleotide polymorphisms (SNPs) can be attributed to sequencing technologies,” continued Johnson. “We evaluated the discordance between platforms and used multiple statistical algorithms to annotate true positives and true negatives.”

Next, EdgeBio developed software to compare the answer key with other sequencing results from the same two reference samples. The algorithm scores the results and produces the quality report. The company integrated the upload of test sequences, comparison, scoring, and reporting into a workflow with an intuitive interface (www.validationprotocol.org).

Even before the XPrize, EdgeBio was deeply invested in clinical sequencing and received CLIA certification in 2012.

“While the significance of whole genome is still not quite established, medical-grade sequencing of exomes, targeted gene pools, or transcriptomes may provide clinically actionable information,” said Johnson. “Development of performance metrics will speed up the incorporation of next-gen technologies into clinical diagnostics.”

“Simply annotating and aligning DNA sequences is not enough to discover their biomedical value,” said Martin Seifert, Ph.D., CEO, Genomatix. “To perform meaningful analysis of their sequencing data, the researchers need to view it in combination with existing biological knowledge.”

The Genomatix Genome Analyzer (GGA) enables visualization of NGS data in a context of multiple databases containing a comprehensive compilation of information on transcriptional regulation, DNA binding sites, epigenomic spots, and signaling networks.

“Knowledge datasets are available for 33 different organisms adding up to several terabytes of data. Cross-organism comparisons help assign meaning to genetic elements for which the function is not yet understood.”

The GGA can also be complemented by the company’s second turnkey solution, the Genomatix Mining Station (GMS), which does high-performance NGS mapping. Each GMS node houses 64 gigabytes of memory to be able to handle the terabytes of data, which is necessary to cope with the NGS data output. The GMSs can be scaled-out, just like EMC Isilon units, to match the ever-growing volume of raw data produced by NGS sequencers. The interfaces of both machines are web-based and can be accessed by multiple users without the need to transfer big data.

The company also plans to offer its tools via Illumina’s BaseSpace apps later this year to accommodate users with limited datasets, such as benchtop sequencers.

In collaboration with the NIH, Genomatix software helped to discover synergies between several transcriptional regulatory factors directly influencing maintenance of mammalian photoreceptors. Using chromatin precipitation data and GGA technology, complex regulatory maps of rod and cone genes were assembled. These maps can model the effect of regulatory factors on photoreceptor expression and, thus, better understanding of retinal neurogenerative diseases.

Another key collaboration with the Center for Prostate Disease Research revealed an early biomarker that may predict future metastatic progression of prostate cancer. Genomatix is invested in academic research collaborations, and is an SME partner in various consortiums like BLUEPRINT (generating 100 reference epigenomes), m4 Personalized Medicine (gene network analysis for prognostics and diagnostics on personalized basis), SYNERGY-ERASysBio+ (characterizing roles of nuclear receptors), and MedSyS (signaling in pluripotent stem cells).

“Our approach facilitates translation of genetic data into biomedical knowledge, and we’re working hard to get it into diagnostics,” concluded Dr. Seifert.

Managing All that Data

The Jackson Laboratory boasts roughly 5,000 strains of mice. It uses these models to conduct next-generation sequencing (NGS) analysis for a variety of purposes such as discovery of spontaneous mutations, strain-specific variation, and genome-wide analysis of gene expression. This year it added a Convey Computer hybrid-core computer to its lab with the goal of speeding current research and enabling scientists to undertake whole-genome studies that were previously impractical.

“Once we could afford whole-genome sequencing, we found a significant bottleneck in the time required to process the data,” says Laura Reinholdt, Ph.D., a research scientist at the Jackson Laboratory. “That’s when biologists here began to seek tools and infrastructures to more expediently manage and process the expanding volumes of NGS data.”

Data overload has become a growing problem when it comes time to analyze results of NGS experiments. The hybrid-core architecture of the Convey system aims to ease the bottleneck by pairing classic Intel processors with a coprocessor composed of field-programmable gate arrays (FPGAs). Particular algorithms—DNA sequence assembly, for example—are optimized and translated into code that’s loadable onto the FPGAs at runtime, accelerating performance-critical applications.

Dr. Reinholdt’s group has used high-throughput sequencing to improve mouse models of ALS, Down syndrome, and Alzheimer’s disease. Performing alignment on the laboratory’s 32-core servers was a slow process, she says, noting that the HC-2’s higher throughput gives researchers more flexibility to adjust parameters, quickly perform multiple runs, and achieve better results.

“You can end up spending weeks just trying to find the right parameters. If you can do two or three alignment runs in parallel, optimization of the alignment becomes much less time consuming,” comments Dr. Reinholdt.