NVIDIA, the Silicon Valley micro-processing giant that invented the graphics processing unit (GPU) more than two decades ago, is expanding its presence in AI-based drug discovery with three partnerships announced at its virtual GPU Technology Conference (GTC), being held this week.
The company is joining with Schrödinger to further expand the speed and accuracy of its computational platform; collaborating with AstraZeneca to develop a transformer-based generative AI model for chemical structures; and teaming up with UF Health, the University of Florida (UF)’s academic health center, to apply what the partners say is the largest clinical language model developed to date for drug discovery as well as clinical medicine.
“We are digitizing biology at a pace we never have before that is going to infuse a new way of doing drug discovery. We need to have new approaches with AI because that level of data—that amount of data is too much for any human to be able to understand,” Kimberly Powell, NVIDIA’s Vice President of Healthcare, told reporters during a virtual briefing. “AI approaches in drug discovery are really going to create that next generation of computational drug discovery.”
In his keynote address at GTC 2021—which he delivered as last year from his kitchen, citing the COVID-19 pandemic—Jensen Huang, founder, president, and CEO of NVIDIA, disclosed that Schrödinger recently agreed to use “hundreds of millions” of NVIDIA GPU hours, effectively tripling its throughput, under an expanded collaboration of undisclosed value with Google Cloud announced on February 25.
As some Schrödinger customers cannot use the cloud, Huang said, the company ramped up an existing partnership with NVIDIA to accelerate its drug discovery workflow. NVIDIA plans to optimize Schrödinger’s free-energy perturbation (FEP+) platform, designed to model and predict the properties of novel molecules, for the NVIDIA DGX SuperPOD system, which consists of NVIDIA DGX A100 AI supercomputers and NVIDIA InfiniBand HDR networking.
The companies said they will partner on scientific and research breakthroughs that further advance physics-based computing and machine learning for drug discovery, with the goal of enabling rapid, accurate evaluation of billions of molecules for potential development within minutes.
“We’ve essentially accelerated the ability to do the work by five times,” Powell said. “We can simulate over 1 million drug candidates in a year. To put that in perspective, if you were to do this in the lab, it would cost you well over $100 million and it would take well over five years to do it.”
“We’re delighted to be building on what we what we’ve done over the last five years, in really packaging up a solution for the entire industry to benefit,” Powell added.
Productivity boost
Based in New York City, Schrödinger has licensed its drug discovery computing solutions to all of the top 20 pharmaceutical companies as measured by 2019 revenue. “Their researchers are going to see a giant boost in productivity,” Powell added.
Those top 20 pharmas accounted for $31.9 million or 34% of Schrödinger’s software revenue last year. Beyond the top 20, Schrödinger says, a growing number of biopharmas have contracts exceeding $100,000 in annual contract value—from 122 in 2018, to 131 in 2019, to 153 last year.
Drug discovery revenue rose about 17% year-over-year, to $26.6 million in 2020. That includes $1 million of the $55 million Bristol-Myers Squibb agreed to pay upfront under Schrödinger’s last-announced drug discovery collaboration in November, with the company eligible for up to $2.7 billion in milestone payments and royalties.
“Biopharmaceutical companies are increasingly adopting our software at a larger scale, and we anticipate this scaling-up will drive future revenue growth,” Schrödinger stated in its Form 10-K annual report for 2020, its first year as a public traded company.
After raising net proceeds of $209.6 million in its initial public offering (IPO), Schrödinger finished last year with a net loss of $24.46 million after accounting for controlling interests, flat from 2019’s net loss of $24.57 million—though revenues grew 26% to $108.09 million in 2020 from a year earlier.
Santa Clara, CA-based NVIDIA finished the year ending January 31, 2021, with net income of $4.33 billion, up 55% from the 12 months ending January 26, 2020, while revenue zoomed 55% year-over-year, to $16.67 billion.
Record high stock
While its week-long annual GTC conferences enable NVIDIA to showcase the breadth of its work by timing a flurry of announcements in and outside of drug discovery, investors and analysts appear to view the resulting blizzard of news as more help than hype—and so do investors.
Investors responded to the drug discovery announcements Monday by sending shares rising nearly 6%, from a closing price of $576 to $608.36 on Monday. Shares have continued to climb since then, closing Tuesday at a record-high $627.18 before dipping yesterday to $611.08.
NVIDIA’s share price has more than doubled year-over-year, rocketing 115% from $283.95 on April 14, 2020.
That explains why of 26 analysts who rated NVIDIA shares, according to online analyst ranking website TipRanks, 22 maintain “buy” ratings on the company’s stock—a consensus that TipRanks considers a “strong buy”—while four analysts have “hold” ratings, and only David Wong of Instinet-A Nomura Company rates NVIDIA stock as a “sell.”
Wong downgraded NVIDIA from “neutral” to “sell” in February 2020, citing potential risk to NVIDIA and other giants in the semiconductor industry from COVID-19: “We think many investors and companies may have underestimated the risk of the current issues impacting electronics end market demand through 2020.”
Wong also lowered his price target on NVIDIA shares from $235 to $230—only to raise his target to $260 three months later in May 2020 after the company reported stronger than expected Q1 results. He observed that NVIDIA’s gaming segment, which accounts for more than half of total sales, demonstrated “more resilience to the global health and economic issues than we had expected.”
NVIDIA hopes to grow further in part by acquiring chip designer ARM Holdings for $40 billion—a deal now under antitrust regulatory review in the U.S. and the U.K., where ARM is based, amid protests from rivals that include Google, Microsoft, and Qualcomm that the deal would harm competition. The deal would position NVIDIA as a leader in building computing systems large enough for data centers rather than rely on CPUs from others such as Intel.
NVIDIA has incorporated an ARM processor into the new high-performance central processing unit (CPU) for large-scale neural networks it announced at GTC 2021 with plans to reach the market in 2023. The CPU is called Grace and named for computing pioneer and U.S. Navy Rear Admiral Grace Hopper (1906-1992), whose accomplishments included helping develop helping to devise the first commercial electronic computer, UNIVAC1.
Drug discovery model
NVIDIA says its partnership with AstraZeneca entails creating a generative AI model for chemical structures used in drug discovery that is envisioned to help researchers conceptualize molecules that could be potential drug candidates but do not exist in databases. Potential uses for the model will include de novo molecular generation, reaction prediction, as well as molecular optimization for desired pharmacokinetics properties related to absorption, extrusions, and toxicity.
“You can generate a molecule that has never existed in a database before, and that’s really important because we know that there is over 1060 —a completely intractable number of potential molecules out there,” Powell said. (Powell cited the highest in a range of estimates in a 2012 review article published in ACS Chemical Neuroscience; others are lower at between 1020 and 1024 once combinations of known fragments are considered).
“We are going to discover more novel molecules that are needed to treat the over 10,000 diseases that still go without treatment,” Powell predicted.
The AstraZeneca NVIDIA model, called MegaMolBART, will be among the first projects to run on NVIDIA’s new Cambridge-1 supercomputer, which the company says will be the U.K.’s most powerful supercomputer when it goes online later this year.
In October, NVIDIA said AstraZeneca would be among users of Cambridge-1 when it announced an AI drug discovery partnership with GlaxoSmithKline that was also tied to use of the supercomputer. Cambridge-1 was originally set to come online last year, but was delayed as COVID-19 led to three separate lockdowns in the U.K. and global travel restrictions that forced the installation to be overseen remotely from the U.S.—yet deployment of the supercomputer was finished last month, at a still-rapid pace of 20 weeks from announcement.
In addition to Cambridge-1, NVIDIA’s collaboration with AstraZeneca will use NVIDIA’s Selene supercomputer, built on the NVIDIA DGX SuperPOD and ranked No. 5 on the most recent TOP500 list of global supercomputers in November.
MegaMolBART is built on NVIDIA DGX SuperPOD, based on AstraZeneca’s MolBART transformer model and is being trained on the public-access ZINC chemical compound database using the NVIDIA Megatron framework to enable massively scaled-out training on supercomputing infrastructure. ZINC enables researchers to pretrain a model that understands chemical structure, bypassing the need for hand-labeled data.
MegaMolBART will be specialized for tasks that include predicting how chemicals will react with each other and generating new molecular structures. Once developed, the model will be open sourced, available to researchers and developers in the NVIDIA NGC software catalog.
Relationships between atoms
“Just as AI language models can learn the relationships between words in a sentence, our aim is that neural networks trained on molecular structure data will be able to learn the relationships between atoms in real-world molecules,” Ola Engkvist, PhD, associate director, Computational Chemistry, Discovery Sciences, R&D at AstraZeneca, said in a statement.
Another life-sci company using MegaMolBART with success is Insilico Medicine, a partner in the NVIDIA Inception accelerator program. On February 24, Insilico said it discovered the first preclinical candidate, a treatment for idiopathic pulmonary fibrosis (IPF) whose novel molecule and novel target were both identified through transformer-based generative AI models.
Insilico started with a set of 20 novel targets discovered by AI for fibrosis, then narrowed down the targets to specifically address IPF. The molecules were first generated using Insilico’s Chemistry42 system using a structure-based drug design generative chemistry approach powered by NVIDIA V100 Tensor Core GPUs before testing in human cell and animal models. The molecules were redesigned using the ligand-based drug design to optimize for additional properties, then tested in human cells and animal models.
The effort, Insilico said, took less than 18 months and $1.8 million from target hypothesis to IPF preclinical candidate selection. The company also spent $800,000 toward candidates for other fibrotic disorders, with less than 80 small molecules synthesized and tested.
As with the Schrödinger partnership, the AstraZeneca collaboration draws upon Clara Discovery, a collection of frameworks, applications, and AI models enabling GPU-accelerated drug discovery, with support for research in genomics, proteomics, microscopy, virtual screening, computational chemistry, visualization, clinical imaging, and natural language processing (NLP).
Clinical “gator” aid
MegaMolBART is one of four new models within Clara Discovery that were highlighted by NVIDIA at GTC 2021. The other three:
- ATAC-seq, a de-noising algorithm for rare and single-cell epigenomics, designed to help researchers understand gene expression for individual cells.
- AlphaFold1, a model designed to predict the 3D-structure of a protein from its amino acid sequence.
- GatorTron™, an AI transformer NLP model which according to NVIDIA is the world’s largest clinical language model that can read and understand doctors’ notes.
GatorTron was developed through UF’s $100 million partnership with NVIDIA and uses the company’s Megatron framework, trained on records from more than 300 million unstructured notes across 2 million patients generated over 50 million patient encounters. GatorTron also uses a DGX SuperPOD (nicknamed “HyperGator”) gifted to his alma mater the University of Florida by Chris A. Malachowsky, NVIDIA Fellow and a senior technology executive who co-founded the company with Huang and Curtis R. Priem.
In addition to clinical record-keeping, GatorTron is expected to bolster clinical R&D for new drugs by helping identify and recruit patients for clinical trials by facilitating rapid creation of patient cohorts; predicting and alerting healthcare teams about life-threatening conditions; and providing clinical decision support to doctors by studying the effect of a drug or vaccine.
“GatorTron leveraged over a decade of electronic medical records to develop a state-of-the-art model,” stated Joseph Glover, PhD, UF Provost and Senior Vice President for Academic Affairs. “A tool of this scale will enable healthcare researchers to unlock insights and reveal previously inaccessible trends from clinical notes.”
Powell said entity recognition will be a key benefit of GatorTron: “To say this is a test, this is a treatment, this is a symptom. It allows you to really extract information out of this massive untapped data source of electronic medical records, and it even improves their own patient, deidentification or anonymization methods at the University of Florida.”
“Oftentimes, hospital specialize in certain diseases or certain treatments, so they have their own ontologies, their own language. We have just witnessed a public health crisis that has introduced a whole bunch of new language that we never had in our medical records before,” Powell explained, alluding to the pandemic. “Now we’re at a point where we’ve democratized that capability. We have essentially democratized the ability for every academic medical center to be able to build their own clinical language models, and they want to do that.”
Computing, DaVinci style
Huang highlighted two other life-sci companies using NVIDIA’s technologies during his keynote address. Sequencing specialists Oxford Nanopore Technologies—the British unicorn company whose founders pioneered and successfully commercialized nanopore sequencing—is integrating an NVIDIA DGX Station A100 and its Tensor Core GPU into the PromethION ultra-high throughput sequencing system, with the goals of supporting real-time analyses at scale, and analyzing any length fragment of DNA and RNA.
DGX Station A100 is a 2.5 petaFLOP benchtop AI computing system containing four NVIDIA A100 80GB GPUs, fully connected via NVIDIA NVLink, to offer a total of 320GB of GPU memory. Oxford Nanopore’s PromethION instrument can generate as much as 10 Terabases of sequence data per 72-hour run (equivalent to 96 human genomes at 30X coverage).
Recursion Pharmaceuticals, which went public by launching a $436 million IPO this week, has installed BioHive-1, a supercomputer based on NVIDIA DGX SuperPOD.
“Deep learning projects that took a week to run on our previous cluster can run in under a day on the new cluster,” Recursion stated in its Form S-1 registration statement, filed March 22.
BioHive-1 consists of 40 NVIDIA DGX A100 640GB nodes, which Recursion said expanded its capability to rapidly improve machine learning models for generating, analyzing, and gaining insight from its proprietary collection of highly relatable, high-dimensional biological and chemical datasets spanning multiple different data modalities, called the Recursion Data Universe.
‘We actually built [BioHive-1] in just three weeks because the NVIDIA DGX SuperPOD architecture is really a datacenter architecture, already “specced” out and built all over the world and used by NVIDIA and lots of supercomputing centers and other industry partners,” Powell said.
As of March 22, the Recursion Data Universe contained nearly 8 petabytes of highly relatable biological and chemical data. The core dataset is based on billions of labeled images of human cells generated across millions of unique perturbations—such as gene knockout, soluble protein factor addition, drug addition, or combinations—in Recursion’s wet labs. As of December 2020, that process generated up to 9 million images, a subset of Recursion’s data Universe at approximately 80 terabytes of data, generated across up to 1.5 million experiments per week.
“NVIDIA is a computing platform company, helping to advance the work for the Da Vincis of our time–in language understanding, drug discovery, or quantum computing,” Huang said during his kitchen keynote address. “NVIDIA is the instrument for your life’s work.”