The human exome, which consists of the protein-coding regions of DNA, amounts to 1% of the whole genome and thus is a much smaller beast to tame than whole sequence data. [Kirsty Pargeter/Fotolia.com]
Cranking out individual genomes has certainly become faster and cheaper. The cost of analyzing the data blitz to provide enough meaningful information connected to functional reality, however, remains daunting.
As of June 2010, there were more than 900 published genome-wide associations for 165 traits. While these studies have helped define the possible genetic basis of disease, they have identified only a small fraction of heritable genomic variation. They also raise the question, scientists have noted, whether rare variants could account for a significant fraction of unexplained heritability.
At the combined American Society of Human Genetics (ASHG) and International Congress of Human Genetics (ICHG) conference, held October 11–15, panelists agreed that, with the exception of diagnosing and providing genetic counseling on Mendelian diseases, sequence data is “not ready for prime time.” They said it is far too early to share the information with patients and perhaps even with clinicians. At least in terms of sharing information, the panel said, the gold standard remained targeted sequencing for identification of specific genetic diseases.
But somewhere between targeted sequencing and likely uninformative whole sequence data lies exome sequencing, a technique that selectively targets the most functionally relevant DNA sequences that encode proteins. Researchers are hence working hard at developing second-generation methods for targeted sequencing of all protein-coding regions (exomes), to reduce costs while enriching for discovery of highly pertinent variants.
The 1 Percent
The human exome, or total exon complement, amounts to 1% of the whole human genome. At a mere 30 megabases (Mb) of DNA, it is a much smaller beast to tame, especially when trying to handle large numbers of genomes to identify novel disease genes.
Also, since most diseases map to protein-coding genes or their regulatory elements and little is known about most noncoding variants, it makes sense to focus on the interpretable regions, especially given limited resources, according to some researchers.
Exome sequencing, or targeted sequencing restricted to the protein-coding subset of human genes, may provide a powerful and cost-effective new tool for dissecting the genetic basis of diseases and traits that have proved to be intractable to conventional gene discovery.
For example, at the Baylor College of Medicine Human Genome Sequencing Center in Houston, 20 Applied Biosystems SOLiD sequencing instruments, each producing about 30 gigabases (Gb) per run, crank out some 2 terabases of sequence per month, Donna Muzny, director of operations, said in an interview with Science Magazine. This amounts to about 666 human genomes’ worth; one complete human genome at 30x coverage requires 90 Gb and typically consumes three machines for one 10-day run.
But, Muzny said, a 30 Mb exome ideally requires just 900 Mb to achieve comparable coverage, though in practice about 5–6 Gb are collected. That means one machine can collect five to six complete datasets, saving a lot of time and money, especially for studies requiring several hundred samples, such as the facility’s ongoing cancer and autism work. Although sequencing a complete genome may take only one week on a single machine, one can sequence more than 20 exomes in the same time.
Added to the need for data collection, storage, and analysis, the price of doing large-scale informative whole-genome sequencing goes way up on the back end. With exome sequencing, though, the bioinformatic challenges are comparably modest.
Understanding Mendelian Diseases
Exome sequencing has rapidly become one of the main tools for studying the genetic causes of Mendelian disease because academic groups with access to only one or two next-generation sequencing (NGS) systems can use this approach to study the exomes of hundreds of patients with Mendelian diseases per year. Since November 2009, exome sequencing has led to the identification of over 30 new genes in Mendelian diseases.
Exome sequencing involves an initial enrichment of the targeted DNA regions by hybridization with probes followed by NGS. Data is analyzed to pick out the functional variation and to identify novel mutations associated with rare and common disorders.
In an article published August 2009 in Nature Genetics, a team of scientists headed by Jay Shendure, M.D., Ph.D., assistant professor of genome sciences at the University of Washington, demonstrated that targeted capture and massively parallel sequencing could be a cost-effective, reproducible, and robust strategy to identify variants causing protein-coding changes in individual human genomes.
Using this approach they determined 307 megabases across the exomes of 12 individuals. Freeman-Sheldon syndrome, a rare, inherited disorder, was used as a proof-of-concept to show that candidate genes for monogenic disorders can be identified by exome sequencing of a small number of unrelated, affected individuals.
Although the underlying genetic defect behind the disease was already known, the technique zeroed in on the exact gene responsible for the disease, demonstrating that it was feasible to sort out the genetic signal from more than 300 million bases of DNA.
Using the same strategy, Michael Bamshad, M.D., a professor in the department of pediatrics and adjunct professor of genome sciences at the University of Washington, published a paper in November 2009 reporting on the gene underlying the uncharacterized Mendelian disorder Miller syndrome.
For four affected individuals in three independent kindreds, Sarah Ng, the paper’s first author, and her colleagues captured and sequenced coding regions to a mean coverage of 40x and sufficient depth to call variants at about 97% of each targeted exome.
Filtering against public SNP databases and eight HapMap exomes for genes with two previously unknown variants in each of the four individuals identified a single candidate gene, DHODH. This gene encodes an enzyme required in the pyrimidine de novo biosynthesis needed for DNA and RNA synthesis.
Sanger sequencing confirmed the presence of DHODH mutations in three additional families with Miller syndrome. No similar mutations were found in 100 unaffected individuals.
The authors said they had demonstrated that exome sequencing of a small number of affected family members or affected unrelated individuals provides a powerful, efficient, and cost-effective strategy for markedly reducing the pool of candidate genes for rare monogenic disorders and may even identify the responsible gene(s) specifically.
The approach, they noted, is likely to become a standard tool for the discovery of genes underlying rare monogenic diseases and to provide important guidance for developing an analytical framework for finding rare variants influencing risk of common disease.
Expanding Its Application
While the technique is currently most applicable to monogenic diseases, scientists are committed to applying it to more common, complex conditions like cancer and Alzheimer disease.
Recognizing the need to get sequencing data into the clinical setting, the National Heart, Lung, and Blood Institute (NHLBI) in 2008 awarded $12 million for exome sequencing technology development, through its Exome Project, to the Broad Institute, Harvard Medical School, and the University of Washington. In 2010, NHLBI advanced the project into “production mode” with another $64.5 million—$25 million each to the University of Washington and Broad Institute for sequencing, and the balance for data and sample management.
“The studies we did provide a framework both technically and analytically for how to think about these things moving forward, at least a starting point,” says Dr. Shendure, who has funding from both NHGRI and NHLBI to address that issue.
“But I think it’s certainly going to be challenging as we try to move from things that are really monogenic and simple and Mendelian to things that are more complicated.” As various kinks in the process are worked out, exome sequencing has the potential to provide a key tool in bridging the gap between WGS and targeted sequencing, allowing clinicians to get from gene sequencing to medical practice.