Whether we are predisposed to particular diseases may depend to a large extent on variations in our genomes, but the influence on the presentation of certain pathological traits of genetic variants that occur only rarely in the population has so far been difficult to determine. Researchers at the German Cancer Research Center (DKFZ), together with colleagues at the European Molecular Biology Laboratory (EMBL) and the Technical University of Munich (TUM) have now introduced an algorithm based on deep learning that can predict the effects of rare genetic variants. The method, deep rare variant association testing (DeepRVAT) allows persons with high risk of disease to be distinguished more precisely and facilitates the identification of genes that are involved in the development of diseases.

“DeepRVAT has the potential to significantly advance personalized medicine,” suggested physicist and data scientist Oliver Stegle, PhD, at DKFZ. “Our method functions regardless of the type of trait and can be flexibly combined with other testing methods.” Stegle is co-senior and co-corresponding author of the team’s published paper in Nature Genetics, which is titled, “Integration of variant annotations using deep set networks boosts rare variant association testing.” In their report, the team stated, “DeepRVAT leverages the flexibility of deep neural networks to integrate rare variant annotations while offering a calibrated statistical framework for gene–trait association testing.”

Every individual’s genome differs from that of other human beings in millions of individual building blocks. Many of these genetic variants are associated with particular biological traits and diseases. Such correlations are usually determined using genome-wide association studies (GWAS). “Rare variants in particular often have a significantly greater influence on the presentation of a biological trait or a disease,” said co-lead author Brian Clarke, PhD, at DFKZ. “They can therefore help to identify those genes that play a role in the development of a disease and that can then point us in the direction of new therapeutic approaches,” added co-first author Eva Holtkamp, PhD, at TUM.

However, the influence of rare variants, which occur with a frequency of only 0.1% or less in the population, is often statistically overlooked in association studies. “Rare genetic variants can have strong effects on phenotypes, yet accounting for rare variants in genetic analyses is statistically challenging …” the authors wrote. “… extending the GWAS strategy to rare variants must contend with a large number of low-frequency variants, leading to low statistical power due to sparsity and an increased multiple testing burden.”

In order to better predict the effects of rare variants, teams led by Stegle, Clarke, and Julien Gagneur, PhD, at TUM, developed a risk assessment tool based on machine learning. They claim that DeepRVAT is the first to use artificial intelligence (AI) in genomic association studies to decipher rare genetic variants.

The model was initially trained on the sequence data (exome sequences) of 161,000 individuals from the UK Biobank. In addition, the researchers fed in information on genetically influenced biological traits of the individual persons as well as on the genes involved in the traits. The sequences used for training comprised around 13 million variants. For each of these, detailed “annotations” are available, providing quantitative information on the possible effects that the respective variant can have on cellular processes or on the protein structure. These annotations were also a central component of the training.

After training, DeepRVAT is able to predict for each individual which genes are impaired in their function by rare variants. To do this, the algorithm uses individual variants and their annotations to calculate a numerical value that describes the extent to which a gene is impaired and its potential impact on health. “DeepRVAT is an end-to-end genotype-to-phenotype model that first accounts for nonlinear effects from rare variants on gene function (gene impairment module) to then model variation in one or multiple traits as linear functions of the estimated gene impairment scores (phenotype module),” the team explained. “The gene impairment module estimates a gene and trait-agnostic gene impairment scoring function that accounts for the combined effect of rare variants, thereby allowing the model to generalize to new traits and genes.”

The researchers validated DeepRVAT on genome data from the UK Biobank. For 34 tested traits, i.e., disease-relevant blood test results, the testing method found 352 associations with genes involved, far outperforming all previously existing models. The results obtained with DeepRVAT proved to be robust and better replicable in independent data than the results of alternative approaches. Another important application of DeepRVAT is the evaluation of genetic predisposition to certain diseases. The researchers combined DeepRVAT with polygenic risk scoring based on more common genetic variants. This significantly improved the accuracy of the predictions, especially for high-risk variants. “On 34 quantitative and 63 binary traits, using whole-exome-sequencing data from UK Biobank, we find that DeepRVAT yields substantial gains in gene discoveries and improved detection of individuals at high genetic risk,” they wrote.

In addition, it turned out that DeepRVAT recognized genetic correlations for numerous diseases—including various cardiovascular diseases, types of cancer, and metabolic and neurological diseases—that had not been found with existing tests. In their paper, the team stated, “DeepRVAT represents a conceptual advance by separating trait-agnostic gene impairment scoring on the one hand from gene–trait association testing on the other hand. We have demonstrated the utility of this impairment score for rapid gene–trait association testing by considering traits that were not seen by the model during training.”

Stegle’s team wants to further test the risk assessment tool in large-scale trials as quickly as possible and bring it into application. The scientists are already in contact with the organizers of INFORM, for example. The aim of this study is to use genomic data to identify individually tailored treatments for children with cancer who suffer a relapse. DeepRVAT could help to uncover the genetic basis of certain childhood cancers. “I find the potential impact of DeepRVAT on rare disease applications exciting,” said Gagneur. “One of the major challenges in rare disease research is the lack of large-scale, systematic data. Leveraging the power of AI and the half a million exomes in the UK Biobank, we have objectively identified which genetic variants most significantly impair gene function.”

The next step is to integrate DeepRVAT into the infrastructure of the German Human Genome Phenome Archive (GHGA) in order to facilitate applications in diagnostics and basic research. Another advantage of DeepRVAT is that the method requires significantly less computing power than comparable models.

DeepRVAT is available as a software package that can either be used with the pre-trained risk assessment models or trained with researchers’ own data sets for specialized purposes. “DeepRVAT is provided as a user-friendly software package that supports both de novo training of gene impairment modules and the application of pretrained ones, each with substantial improvements in computational efficiency over existing methods,” the team stated.

Previous articleAI-Powered Drug Repurposing Suggests New Treatments for Rare, Undiagnosed Diseases
Next articlePfizer Withdraws SCD Drug Oxbryta after EMA Discloses 16 Deaths in Trials