Organized by a bioinformatician at Drexel University and hosted by Kaggle, contest among 118 players generated 873 posts in three months. [© frank peters - Fotolia.com]
Watson may have no equal at Jeopardy, but a team from the same IBM’s research center that developed the computer game, named for iconic CEO Thomas J. Watson, only got as far as second place in a competition to develop new algorithms for predicting progression of the HIV virus.
Surpassing Watson was Chris Raimondi of Baltimore, a data-miner self-taught via YouTube clips after 10 years in search engine optimization and internet marketing. Raimondi won the $500 prize from among 118 players who posted 873 entries in a three-month contest to find markers that predict a change in the severity of the infection as measured by viral load and CD4 cell counts.
The contest remains an example of how predictive data modeling could, and should, serve as a model for tackling some of the toughest problems in bioinformatics. The competition was organized by William Dampier, a bioinformatician at Drexel University, and held by Kaggle, a San Francisco-based host of predictive data-modeling competitions.
Raimondi wrote an algorithm that predicted outcomes with 78% accuracy compared to 70% accuracy for the previous best models. Anthony Goldbloom, Kaggle’s founder and CEO, told GEN that the HIV competition ended up advancing knowledge on progression of the virus beyond what was previously known. “When you open a problem to a wide audience, people have not seen that problem before, and they therefore apply techniques that have not been applied before.
“The scientific literature tends to slightly improve incrementally, where somebody tries something, and that sets the tone for what comes next. Whereas if you open up a problem as a competition,” Goldbloom pointed out, “people don’t come to the problem with the same preconception about what ought to work and therefore try off-the-wall things that sometimes end up doing very well.”
Another reason competitions work, he said, is the presence on Kaggle of a golf-like “leaderboard” listing the top 10 players and their scores as the competition progresses: “Participants drive performances to what we call the frontier of what’s possible.”
In the HIV competition, which incorporated data from about 2,000 patients, Kaggle revealed to players the extent of the viral load increases or decreases of the first 1,000 patients, though not in the other 1,000 patients. “We gave players the genetic markers, but we withheld the information on whether or not viral load increased or decreased. Participants would train their models on the first set where they’ve got the answers. They then applied their models to the second set, where they don’t have the answers, and their answers were compared with actual outcomes,” Goldbloom explained.
Potential in Biopharma
The HIV competition is the only example where Goldbloom believes his platform has been used to host a competition in biotechnology, where he acknowledged that Kaggle “is not being used as extensively as we might like.”
One key reason for that can be found in Kaggle’s community of almost 22,000 data scientist players: just 8% have backgrounds in bioinformatics, biostatistics, and computational biology. That 8% is Kaggle’s fifth-largest group among 14 categories of players; the most common skillbase, at 15.6%, is computer science, followed by statistics at 11.6%, and economics at 10%.
Researchers come to Kaggle via word of mouth, and about 40% come from North America. According to Goldbloom, researchers aren’t vetted or selected: “If we were vetting, we probably wouldn’t have allowed Raimondi in, and look what we would have missed out on.”
Goldbloom said Kaggle has encountered two hurdles in trying to host life science competitions. One is the reluctance of many biopharma companies or institutions to make public data that they deem sensitive or confidential. Kaggle tries to address this by offering to host private competitions limited to players qualified by a sponsor.
“The more vexing problem that we come across is the one of, ‘Hey, I just spent all this time and money putting this dataset together. I want to be the one to capitalize on any discoveries that are made; I want to be the one that publishes the paper that uses this data,’” Goldbloom explained. The top three winners of Kaggle-hosted competitions agree to license their models to competition sponsors in exchange for the prize money.
That pushback might explain the outcome of a spot-check by GEN with spokespeople for three research institutes that maintain bioinformatics programs: the U.S. Department of Energy’s Joint Genome Institute, Cold Spring Harbor Laboratory, and The Institute for Genome Sciences at the University of Maryland School of Medicine. None of the three were familiar with Kaggle.
How helpful do you think a predictive data-modeling contest could be for biotech researchers?
Kaggle generates revenue from sponsors who pay to host competitions. “If you think about where we’re most likely to make the majority of our money, it’s probably in financial services,” Goldbloom stated. That makes sense: The industry has the money to fund sophisticated number-crunching that can produce relevant predictive models.
“That’s probably where Kaggle will be most commercially successful, but we’re very much interested in doing biotech and life sciences,” he added. “Our community isn’t going to be very satisfied just solving problems for insurance companies or hedge funds all day long. They’re going to want to do meaningful work as well.”
The most lucrative award among competitions is the Heritage Health Prize, whose sponsor Heritage Provider Network will award a $3 million grand prize to the player that develops a breakthrough algorithm using patient data to prevent unnecessary hospitalizations by predicting which patients are most at risk of an in-patient stay over the next year. The competition ends April 3, 2013.
Kaggle also enjoyed the prestige of being selected by NASA and the Royal Astronomical Society for a competition that offered an all-expense paid trip to the Jet Propulsion Laboratory, valued at $3,000, to the player that could develop new algorithms applicable to measuring the distortions in galaxy images caused by dark matter.
“The society and NASA have both come back to us and said they want to do more,” Goldbloom said. “We’re in discussions with them about running a fellowship where an astronomer will spend three months with Kaggle and a Ph.D. student. Given the success of this project, I think we’ll see much more happen in astronomy than in other scientific fields.”
Kaggle, which moved earlier this year from Melbourne, recently completed an $11.25 million series A financing round led by Index Ventures and Khosla Ventures. Goldbloom said proceeds would be used for scaling up operations. “We want to get to the situation where we’re hosting 10,000 competitions a year.”
Kaggle also wants to grow its researcher community and its staff, which now stands at seven employees. “We would like to be roundabout, I’d say, 20 to 25 in a year’s time,” Goldbloom noted. “That will be a mix of technical, sales, data scientists, and developers.”
To reach its 10,000 competition goal, Kaggle will need to not only expand in lucrative areas like banking but to make itself much better known in life science circles, a challenge recognized by Goldbloom. Research institutes and universities with a strong bioinformatics bent would likely help Kaggle find new players and new problems to solve.
Kaggle will also have to avoid missteps like the failure to re-sort training data and export it to a new file, which occurred in Wikipedia’s Participation Challenge to build a predictive model for the number of edits an editor will make in the five months after the end date of the training dataset. Kaggle’s ability to learn from that and expand its business will determine growth not only for the company but also for its field of predictive data modeling.