It’s an exciting time for next-generation sequencing. Multiple new companies, each with their own emerging platforms and new technologies, have entered the U.S. market in the past year. Although there is a lot of uncertainty in many aspects of this field, everyone can agree that the future will bring a lot more sequencing data. And, as the instruments produce more data, the computing platforms have to rise to the occasion as well.
Now, the Broad Institute and Nvidia, a Silicon Valley micro-processing giant that invented the graphics processing unit (GPU) in 1999, are teaming up. The two have announced a partnership that will provide the Terra cloud platform (Broad’s widely used genomic analysis platform) with Nvidia’s AI and acceleration tools. The result, they say, will be faster analysis of more data.
This partnership builds off of several platforms that have already transformed researchers’ ability to analyze genomic data.
The data science and data engineering group at the Broad developed the workhorse that is widely used to interpret sequence data (which typically comes off the sequencers as a FASTQ file.) The Genome Analysis Toolkit (GATK) focuses on variant discovery and genotyping on both DNA and RNA-seq data. The program is widely used in the genomics community. However, using it requires a certain amount of familiarity with bioinformatics.
More recently, the Broad developed the Terra platform, which runs on Google’s cloud, in collaboration with Verily Life Sciences. Terra is a scalable, open-source platform that not only allows researchers to access data, it also runs analysis tools and allows for collaboration. On top of that, it is easy to use and does not require the same bioinformatics background that GATK does. It’s a “point and click” way to analyze genomes, noted Keith Robison, PhD, genomics expert and author of the omicsomics blog.
The partnership will bring Nvidia’s Clara Parabricks to the Terra platform. Nvidia has been, according to Kimberly Powell, vice president of healthcare at Nvidia, “working on accelerated computing tools for the last three years.” This program, she noted, runs on a multi-cloud platform so that the entire Terra platform can take advantage of it.
Parabricks, a GPU-accelerated software suite for secondary analysis of sequencing data, is now available in six new Terra workflows. Users can analyze a whole genome in roughly one hour with Clara Parabricks (compared to 24 hours in a CPU-based environment.) For Broad’s GATK germline workflow, doing the analysis with Parabricks on GPUs can be less than half the cost.
Anthony Philippakis, MD, PhD, chief data officer of the Broad and co-director of the Eric and Wendy Schmidt Center, tells GEN that the computational needs of NGS (the compute and storage requirements) are only going to continue to grow. The conversation that used to center around decreasing the cost of reagents, he noted, has moved over to sequencing data. And, this requires a new generation of hardware acceleration, to process data cheaper, faster, and better.
In addition, Nvidia is contributing a new deep learning model directly to the GATK toolkit.
Using large language models (LLMs), researchers will develop foundational models for DNA and RNA to better understand human biology using Nvidia’s BioNeMo platform. BioNeMo is an AI application framework that includes pre-trained LLMs for proteins and chemistry that simplify training, inference, and scaling. BioNeMo is an extension of the Nvidia Nemo Megatron framework and is domain-specific for chemistry, proteins, and DNA/RNA sequences.
BioNeMo allows developers to effectively train and deploy biology LLMs with billions of parameters. Together, teams from both organizations will build on this work, creating new models to add to the BioNeMo collection and make available in the Terra platform.
On Nvidia’s company blog, they described the four pretrained language models:
- ESM-1: This protein LLM, originally published by Meta AI Labs, processes amino acid sequences to generate representations that can be used to predict a wide variety of protein properties and functions. It also improves scientists’ ability to understand protein structure.
- OpenFold: The public-private consortium creating state-of-the-art protein modeling tools will make its open-source AI pipeline accessible through the BioNeMo service.
- MegaMolBART: Trained on 1.4 billion molecules, this generative chemistry model can be used for reaction prediction, molecular optimization, and de novo molecular generation.
- ProtT5: The model, developed in a collaboration led by the Technical University of Munich’s RostLab and including NVIDIA, extends the capabilities of protein LLMs like ESM-1b to sequence generation.
Broad Institute researchers will also gain access to Monai, an open-source deep learning framework for medical imaging AI, as well as a GPU-accelerated data science toolkit, called Nvidia rapids, for faster data preparation, which can be used for genomic single-cell analysis.
It’s easy to understand why the Broad wants to access the power that Nvidia’s GPUs offer. But why is Nvidia making this move? “They want to move GPUs into healthcare,” noted Robison. And, they likely have their sights set beyond genomes. Bringing this bandwidth to the Broad means the analysis of genomics, transcriptomics, GWAS studies, pathology, cell imaging, and clinical health records.
Powell agreed, noting that they are “only at the beginning of this research initiative.”