Send to printer »

GEN Exclusives: January 01, 0001 (Vol. , No. )

Building Biology with Machine Learning

Why Biotechnology Should Embrace the Power of Machine Learning to Bring Inductive Reasoning to Bioengineering

  • The tech world has embraced Machine Learning (ML) for its powerful intuitive capabilities—to increase click-through rates on ads, sell more books, and help you keep in touch with mom. Despite being increasingly common as a classification tool in applications ranging from transcriptomics, metabolomics, and neuronal synaptic activities, ML is still almost absent in the area of bioengineering. Why is that and what can we do to increase ML use in bioengineering?

    Machine Learning algorithms that date back half a century are now commonly used for pattern-based analysis, including Decision Trees, Nearest Neighbors, Neural Nets, and more recently with significant success Deep Learning—a version of Neural Net with more layers and more nodes—received significant attention when it won against the best human in the ancient Chinese game of Go. Deep Learning has been enabled by access to new powerful computational hardware, in particular the graphical processing units (GPUs) originally developed for the gaming industry. These gaming GPUs allow for massively parallel computations, which is perfect for ML applications. It’s comforting to know that Call of Duty brought something of value to this world. In recent years we have seen ML flourishing in a broad range of applications where there is sufficient amounts of data to digest and classify; from self-driving cars to Barcelona FC soccer strategy, to deciding if you get the bank loan.

    But think instead about a common diabetes complication, diabetic retinopathy, which results in irreversible blindness if not caught early. There are today >400 million diabetic patients at risk, many in underserved areas with limited access to clinical diagnosis. In a recent JAMA publication, Google Research applied Deep Learning to diagnose diabetic retinopathy patients from photographs of their retina. An initial set of 128,000 retina images was analyzed and scored by trained ophthalmologists for signs of onset of diabetic retinopathy. The images and the scoring were then processed by Google’s Deep Learning software to identify patterns in the images that correlated with the clinical scoring. The resulting algorithm was subsequently validated with a separate set of ~12,000 images that the software had not seen before.

    Not only did the Deep Learning image analysis software recognize early signs of the disease just as well as the human experts, it did so much more consistently. It’s easy to see a day in the not too distant future when anyone with a smartphone will be able to diagnose this disease accurately and save millions of people from going blind. It will be exciting to see how fast this and similar algorithms will transform medical image based diagnosis in the areas of radiology, pathology, and dermatology.

    Small molecule drug discovery is another arena where ML is rapidly gaining traction. Companies ranging from GSK and Pfizer to Atomwise, Numerate, and InSilico Medicine are compiling large datasets of ligands, targets, and associated biological functions to identify and quantify the patterns of ligand-target interactions using Deep Learning. Atomwise has an undisclosed, previously approved drug candidate that blocks Ebola infection as well as another promising lead molecule to treat multiple sclerosis. Both were identified using Deep Learning to find patterns among thousands (in the case of Ebola) and millions (in the case of multiple sclerosis) of related molecules and their physicochemical properties.

    So if we understand the powerful and intuitive nature of ML, what has limited its application in bioengineering?

    Is it just too new an idea? Probably not, seeing as early as the 1990s, thought leaders like David Haussler at UCSC and Tim Hunkapiller at Caltech were publishing papers using hidden Markov models to capture patterns in DNA and protein datasets. These patterns have subsequently propagated into PFAM and other well-established databases to classify enzymes from protein sequences. So it’s not a new idea.

    Is it because we lack sufficiently large datasets? Maybe. Most curated sequence datasets that include quantified biological function are tiny (in the hundreds) and nonsystematic in that variables are rarely tested in more than one context. On the other hand, Genbank and WGS today encompass ~2 x1012 bp of naturally existing biological sequences and are growing very rapidly. This enormous dataset is however inherently highly correlated because of its evolutionary origin, making it difficult to separate causality from correlation and thus limiting its use for identifying sequence-function relationships. Also, only a vanishingly small part of the data is associated with quantified biological function. Despite these limitations, the Genbank and WGS datasets are extremely informative for e.g. protein engineering as they can readily be used to tell us where not to go. Sequences, elements, or amino acid combinations that never or rarely occurred in biology below some statistical threshold can be assumed to not fold and to not generate new biological functions.

    Is it because of differing philosophy of science? That’s part of it. Machine Learning is based on inductive reasoning, i.e. pattern recognition. The system learns from making many observations and finding patterns that can be generalized to a conclusion/hypothesis. Contrary to the inductive reasoning so abundantly and so successfully used by tech companies such as Google, biotechnology has historically been a discovery-based research field led by deductive reasoning. In deductive reasoning we start from a theory and make predictions about what the corresponding observations should be if the theory is correct. Then we look for those observations. However, biology is a gooey and redundant complex megadimensional mess of synergy and antagonism, and an abundance of variables that just came along for the 4 billion year ride of evolution. It quickly becomes humanly impossible to build complex hypotheses that explain biological observation in accordance with deductive reasoning. This instead is the type of data that inductive ML thrives on.

    Is it because the cost of making specific observations? Yes and No. The medicinal chemist assessing structure-activity relationships has to independently make and characterize each molecule in the dataset at a large cost. There is thus a significant incentive for the chemist to design and test molecules as efficiently as possible using all available tools—including ML—to ensure success. This is in stark contrast to the molecular biologist who can make large semi-random datasets through methods like error-prone PCR or DNA shuffling at basically no cost. These gene libraries at sizes of 107-109 can be screened for e.g. binding using phage display or similar high-throughput procedures. Accordingly, the cost of finding a binder is small, diminishing the perceived need for tools such as ML. However, finding a binder is still a long way from making a protein pharmaceutical.

    Biotechnology is implicitly well set up for ML applications. Contrary to medicinal chemistry and image-based diagnosis, there are a defined number of available options at each residue and any sequence can be made and tested for function. If we can complement our historical dependence on deductive reasoning with the inductive inference from ML, and increasingly look at biology as something to be engineered instead of a discovery-based science, ML has a bright future in bioengineering.

    After all, if we can see our way to a future where ML and a smart phone can diagnose anyone for diabetes-induced blindness, why not use the same methodology perfected over click-through ads and playing Go to make improved antibodies, better vaccines, and novel diagnostic sensors?