Chris Anderson

Researchers Turn to Cloud Computing as Genomic Sequencing Data Threatens to Overwhelm Traditional IT Systems

In early 2007, the cost of sequencing a single human genome was around $10 million, according to data compiled by the National Human Genome Research Institute, and the decreasing cost was roughly following the path predicted by Moore’s Law. This trend implies that the lowering cost of sequencing was directly related to increases in computing power.

But all that changed by late 2007, as a variety of new, high-throughput sequencing methods—referenced today under the catchall phrase “next-generation sequencing” (NGS)—began to replace the Sanger method. Less than four years later, the cost had plummeted to $10,000 amid wide-eyed chatter of the promise of the $1,000 genome. Now, we are nearly there. And not surprisingly, there is a new target—the $100 genome—which is a mere formality according to sequencing heavyweight Illumina, which launched NovaSeq in January, the tool it says will get us there.

Today, NGS has democratized genomic research. It is routinely used by individual pharmaceutical and academic researchers alike, with thousands of researchers worldwide routinely plumbing the depths of the coding regions of DNA in ways barely imaginable 10 years ago. The availability of broad, detailed datasets from this work has also infiltrated the clinic where doctors can now access genomic information to provide more precise care to their patients.

But as the costs of sequencing have plummeted, the volume of generated sequencing data has concomitantly exploded, presenting challenges in how to store and effectively analyze the growing mountain of genomic data.

According to Bryan Spielman, EVP of strategy and corporate development with biomedical data analysis company Seven Bridges, the pace of change has made even significant infrastructure investments in on-premises computing capacity inadequate. To provide a sense of the scale of data being generated, Spielman notes that the genomic data of 11,000 people currently housed in The Cancer Genome Atlas (NIH-funded) weighs in at more than 1.5 petabytes.

“I was speaking with someone at a top-five pharma company, and 1.5 petabytes is 50% of the storage capacity of their own, on-premises, high-performance computing cluster,” he says. 

In an era when a major undertaking in the U.K. promises to sequence 100,000 genomes and there are both public and private projects that aim to sequence 1 million genomes, it becomes clear that new thinking and strategies for how to manage and leverage the data are needed.

Into the Cloud

As the cost of sequencing has dropped and adoption continues to grow, the move to cloud computing was almost a necessity for the most active sequencing operations. In testimony to the U.S. Congress in the summer of 2014, human genome pioneer J. Craig Venter cited two major developments that had allowed him to start his precision medicine company Human Longevity: the cost of sequencing passing an affordability threshold, and the ability to move the sequencing data it generated to the cloud.

“We are going to rely very heavily on cloud computing, not only to house this massive database, but to be able to use it internationally,” Venter testified regarding the then-fledgling company. He went on to describe how even with a dedicated, fiberoptic network the data moved so slowly between his company in La Jolla, CA, and his non-profit genomic research entity the J. Craig Venter Institute in Rockville, MD, that they would routinely ship data on hard disks via FedEx between locations. “The use of the cloud is the entire future of this field,” he concluded.

Another significant factor speeding adoption of cloud computing comes when an organization’s on-premises capability can’t keep up with the speed and data demands of NGS, says David Shaywitz, M.D., Ph.D., CMO of cloud-based genome informatics and data management company DNAnexus. “People would say to me ‘we have an overwhelming amount of work to do and it shuts down our cluster when we try to do it.’ When they move to the cloud: what would be months of work for them before, they can do in the cloud in hours, so that’s obviously better,” Dr. Shaywitz says.

Further, because the hurdles to entry for NGS are now much lower, and don’t require a significant IT backbone, the lower sequencing costs combined with cloud computing have democratized genomic research. “You are putting the power of sequencing into single-researcher hands with things like [Illumina’s desktop sequencer] MiSeq,” says John Shon, VP, bioinformatics and data science at Illumina. “So even though some of the work has to happen on premises, you can have push-button analysis in the cloud.”

That’s a far cry from just a few years ago, notes Shon, whose background includes stints with Janssen (a division of Johnson & Johnson) and Roche. “There were a lot of homegrown tools back then, almost exclusively local storage, and not very much was standardized at all,” he says. “In the research setting: the data would be collected in one place, you’d have the molecular biology lab that did sample processing, you’d have a sequencing center, and the data would be sent to the bioinformatics groups. So it was not uncommon to have five or six different departments involved in that process.”

But the benefits of the cloud extend beyond more computing power and massive data storage, to providing an environment that fosters scientific collaboration on national and global scales. One example of how the cloud fosters collaborations is found in PrecisionFDA, the FDA’s cloud-based collaborative portal that provides tools for researchers, including reference genomes, allows participating organizations to upload their own data and share tools and analytic methods for querying genomic data.

Launched in December 2015 as part of President Obama’s Precision Medicine Initiative, PrecisionFDA to quickly grew to more than 1,500 researchers representing roughly 600 different companies and organizations. According to Taha Kass-Hout, M.D., FDA chief health information officer, roughly one-third of the participants in PrecisionFDA hail from outside the U.S. “It’s amazing to see how the global community is coming together, and they are contributing data, as well as software [to PrecisionFDA],” Dr. Kass-Hout notes in a 2016 online interview outlining the program.

“The community is working toward advancing the regulatory science behind assuring the accuracy of the next-gen software for the human genome. To do that, we want to provide an environment to share some of the innovations happening in this field, as well as any reference materials they might have,” Dr. Kass-Hout explains. “We also realized there are several members in the community that need the computation platform to help them do the heavy [data-]crunching. We consider it a social experiment behind advancing regulatory science behind NGS.”

“If you are looking for the opportunity to facilitate [collaboration] between distant facilities—because science is global and there is a need for global representation—there is hardly a better way to do it than the cloud,” Dr. Shaywitz concludes.


Moving to the cloud also alleviates organizations from having to invest in significant expertise in data handling and security. [enisaksoy / Getty Images]

More Robust, More Secure

It is no surprise that the major cloud-computing companies are the tech giants of the day: Amazon, Microsoft, IBM, and Google—companies that needed to make massive computing investments in order to meet very short periods of peak demand. Amazon Web Services (AWS) had nearly one-third of the market at the end of 2015, according to research by Synergy Research Group, and among the top four companies, all experienced growth across all industries exceeding 55% for the year.

“The nice thing about cloud computing is the underlying companies like Amazon, Google, Microsoft, and others is that while they don’t have unlimited capacity, it is pretty elastic,” says Spielman. “Remember, [Amazon] was built for Christmas Day. If you are Roche or Memorial Sloan Kettering, you are really only going to be able to build so much capacity internally and the amount of that data, while not infinite, is quite significant.”

While companies were attracted to the expanded computing capacity and flexibility the cloud promised, many were reluctant to embrace it early on due to security concerns, Dr. Shaywitz points out. In an industry that uses patient data for clinical trials, and for broader research studies, working in a secure environment is paramount. He heard those concerns when DNAnexus was helping to design PrecisionFDA.

“A well-designed platform will address these concerns about keeping data private,” he says. “When we were implementing it, there were conspiracy theories about the FDA wanting to look into people’s data. But people see they can upload data and keep it completely private. The last thing FDA wanted was to be responsible for other people’s data.”

Moving to the cloud also alleviates organizations from having to invest in significant expertise in data handling and security, efforts that can take focus away from their core competencies. “Companies like Amazon, and like us, worry about compliance, worry about security,” Dr. Shaywitz notes. “Azure and Amazon are living and breathing this every day.”

Many customers see this expertise as a major benefit. “Security of cloud-based platforms is a huge, huge issue,” adds Shon. “But people now realize that [data] is a lot safer in the cloud than on their own local systems.”  

This article was originally published in the January/February 2017 issue of Clinical OMICs. For more content like this and details on how to get a free subscription to this digital publication, go to www.clinicalomics.com.

Previous articleCRISPR Turbocharges CAR T Cells, Boosts Cancer Immunotherapy
Next articleTwo Different Genetic Conditions Can Combine to Cause Severe Infection