What if you could predict the value of additional DNA sequencing before you did it? Or how deeply you need to sequence to achieve complete coverage in your experiment? It may now be possible, say scientists from the University of Southern California (USC). Andrew Smith, Ph.D., a computational biologist at the USC Dornsife College of Letters, Arts and Sciences, and graduate student Timothy Daley have developed an algorithm that they say could help make DNA sequencing affordable enough for clinics, and could be useful to researchers in a variety of scientific fields.
Extracting information from DNA means deciding how much to sequence: sequencing too little and you may not get the answers you are looking for, but sequence too much and you will waste both time and money. That expensive gamble is a big part of what keeps DNA sequencing out of the hands of clinicians. But not for long, according to Dr. Smith.
“It seems likely that some clinical applications of DNA sequencing will become routine in the next five to ten years,” Dr. Smith said. “For example, diagnostic sequencing to understand the properties of a tumor will be much more effective if the right mathematical methods are in place.”
Dr. Smith and Daley’s algorithm is an empirical Bayesian method to characterize the molecular complexity of a DNA sample for almost any sequencing application on the basis of limited preliminary sequencing. In other words, it predicts the size and composition of an unseen population based on a small sample, which the researchers say lends it to broad applicability. For example, they believe the algorithm could be used to estimate the population of HIV-positive individuals, to determine how many exoplanets exist in our galaxy based on the ones already discovered, and to estimate the diversity of antibodies in an individual.
The mathematical underpinnings of the algorithm rely on a model of sampling from ecology known as capture-recapture. In this model, individuals are captured and tagged so that a recapture of the same individual will be known, and the number of times each individual is captured can be used to make inferences about the population as a whole.
In this way scientists can estimate, for example, the number of gorillas remaining in the wild. In DNA sequencing, the individuals are the various different genomic molecules in a sample. However, the mathematical models used for counting gorillas don’t work on the scale of DNA sequencing.
“The basic model has been known for decades, but the way it has been used makes it highly unstable in most applications. We took a different approach that depends on lots of computing power and seems to work best in large-scale applications like modern DNA sequencing,” Daley says.
Scientists faced a similar problem in the early days of the human genome sequencing project. A mathematical solution was provided by Michael Waterman of USC, in 1988, which found widespread use. Recent advances in sequencing technology, however, require thinking differently about the mathematical properties of DNA sequencing data, explain Dr. Smith and Daley.
Their paper, titled “Predicting the molecular complexity of sequencing libraries”, was published yesterday in Nature Methods.