Computational biology holds enormous promise as a method of answering biological and biomedical questions: used in support of or even in place of laboratory research, it can produce more accurate answers at reduced cost.

The problem is that biological datasets are also enormous—and growing rapidly. As Bonnie Berger, the Simons Professor of Mathematics, explained to the audience at the Dean’s Breakfast in December 2016, the size and scope of the data are a mixed blessing. While massive datasets have the potential to allow new insights into problems in genetic and infectious diseases, cancer, and basic biology, researchers don’t have enough computing power to unlock their full potential. Berger believes that the only way to solve the problem is to design fundamentally better algorithms, where big increases in the size of datasets result in small increases in processing times.

Too big to process

When the first human genome was sequenced in 2002, the power to process genomic data was roughly on par with our ability to produce it. Genomic sequencing was expensive and time-consuming and computers were getting faster and cheaper. The amount of processing power you could buy for a dollar doubled every year.

But the widespread use of next-generation sequencing (NGS) tools soon disrupted this balance by making it relatively easy to sequence a lot of genomes quickly. Other tools, such as mass spectrometry, microarrays, and cryo-electron microscopy, have had the same explosive effect on other kinds of biological data.

Since 2004, the total amount of genomic data has increased ten-fold annually, and the individual datasets required by computational studies are massive and growing. Meanwhile, the growth of processing power hasn’t changed, still doubling annually. Although algorithms have reduced runtimes somewhat, our ability to generate massive datasets still far outstrips our ability to process them.

The same pattern holds true for storing and transmitting data. BGI, one of the largest sequencing centers in the world, doesn’t bother with transmitting data over the internet and simply sends datasets on disks through the mail.

“We can’t process our data. We can’t store our data. We can’t even transmit our data,” Berger said.

Building a data “tree of life”

At the breakfast, Berger explained how she builds algorithms that can make these massive datasets more manageable, using genomic data as her prime example. Berger’s approach is unique because it exploits structures unique to biological data to operate directly on compressed data.

If you represent every short genetic sequence that makes up a genomic dataset with a point and distribute those points in a vector space according to their degree of similarity, a surprising result happens: the data points form a three-dimensional “tree,” rather like the evolutionary trees of life used to represent phylogenetic relationships among species.

The similarity between the two trees is not mere coincidence. The data points form a tree because of features of biological data that are the result of evolution.

Biological data are highly redundant, where many data points are the same or similar. In genomic datasets, redundancy can result from analyzing genomes from the same species, where the genetic material of any two individuals will be mostly the same. But even studies that compare the genomes of multiple species will still result in datasets with redundancy. Many genetic sequences are conserved across species, with more same or similar sequences expected for species that are closer together on the evolutionary tree. Redundancy is not limited to DNA sequences, but also characterizes its products downstream: RNA, proteins, and even metabolic systems and other
biological processes.

Taking advantage of high redundancy, Berger compressed biological data in a way no one had ever done before: she assigned all the data points to clusters with a set radius, and then assigned each cluster a representative. Rather than searching every data point for a query, Berger could use an off-the-shelf computational biology tool to run a search on representative data points, and then search every sequence only in those clusters with a hit.

Two features inherent to biological data ensure that runtimes are minimized, regardless of the size of the dataset: “low metric entropy” and “low fractal dimension.” There are far fewer sequences in nature than can be generated at random, since sequences need useful functions to survive selective pressure. Consequently, biological data show low metric entropy: they don’t distribute evenly across a vector space, limiting the number of clusters needed to cover every data point. Because of low fractal dimension, even expanding a query to neighboring clusters will not greatly increase the number of data points to query. Since biological data is so self-similar, there won’t be that many neighboring clusters to search.

Beyond genomic datasets

Berger has used the same principle behind her compressive genomics tool to address a range of computational biology problems with success, developing a suite of programs capable of applications in metagenomics, searches for small molecules and proteins, and compression of the space-hungry quality scores associated with genetic sequences in a dataset.

One such tool, CORA, provides fast genomic mapping with a high degree of sensitivity. In the field of personal genomics, the goal is to tailor treatments of diseases based on an individual’s specific genetic make-up. However, in order to find variant genetic sequences associated with a disease, the genomes from thousands of individuals must be amplified, randomly chopped up into trillions of short fragments of around 100 base pairs, and then mapped back to loci on a reference genome. Ideally, the reads would be mapped by finding every match, but querying all the reads in so large a dataset would be essentially impossible. Instead, researchers use “best mapping” to match reads to loci, but at the expense of accuracy. Because datasets of genomes from the same species exhibit a high degree of redundancy, CORA was especially effective, resulting in runtimes that were six times faster than the current industry standard—mrsFAST-Ultra—and 46-56 times faster than other mapping tools.