Big data is transforming science

Calorimeters side A inside the ATLAS cavern. © 2011 CERN

Calorimeters side A inside the ATLAS cavern. © 2011 CERN

The emergence of computing in the past few decades has changed forever the pursuit of scientific exploration and discovery. Along with traditional experiment and theory, computer simulation is now an accepted “third paradigm” for science. Its value lies in exploring areas in which solutions cannot be revealed analytically and experiments are unfeasible, such as in galaxy formation and climate modelling.

The eagerness to capitalise on technology innovations has accelerated as access to high-performance computing (HPC) clusters – servers linked up to behave as one – and sophisticated software for parallel applications has become available.

Another major shift is going on, which US computer scientist Jim Gray described in 2007 as “the fourth paradigm”. He believed that the collection, analysis, and visualisation of increasingly large amounts of data will change the very nature of science.

In fact, “big data” is now altering research in every scientific discipline, from astrophysics to zoology. Instead of running computer simulations on HPC clusters and supercomputers, scientists need a different computing resource to store and process the massive data

With the explosion of data collection that is outpacing Moore’s Law, Microsoft Research is developing and applying algorithms to massive amounts of data to help understand the genetic causes of major diseases, aging, as well as environmental issues and other areas of science

Genes in the cloud

One good example of combining big-data research and the cloud is genome-wide association studies. Here, large amounts of data are used to identify potential links between a person’s genome and traits such as the propensity to develop a disease or specific responsiveness to a drug. In such studies, genetic variants are collected from the genomes of large populations, with and without the traits of interest. Then algorithms are used to identify the associations.

The more people in a study, the better the chances are of finding weak genetic links and overcoming potentially confounding factors, like location. But there’s a problem: computer time and memory grow rapidly – polynomially – with the number of subjects. This makes large genome studies incredibly expensive.

Recently, a team at Microsoft Research developed a new algorithm that provides scale linearly – one-to-one – with the number of people in a study. This is a major step forward. Microsoft has installed this new algorithm – called FaST-LMM (Factored Spectrally Transformed Linear Mixed Model) – on its cloud platform, Windows Azure.

“When I hear ‘big data’, I think of hundreds of thousands of individuals – and the DNA for those individuals, and the intricate algorithms we need to process all that data,” says David Heckerman, Distinguished Scientist at Microsoft Research in Los Angeles. Today, the process of sequencing an individual’s genome is relatively simple, but analysing the sequenced data is arduous and complex. That is where FaST-LMM comes in.

Heckerman’s team has applied the FaST-LMM machine-learning algorithms to various data sets provided by collaborators including the Wellcome Trust in Cambridge, UK. The Wellcome Trust 1 data set contains genetic information from about 2,000 anonymous people for each of seven major medical conditions: bipolar disease, coronary artery disease, hypertension, inflammatory bowel disease, rheumatoid arthritis, and diabetes types 1 and 2. It also contains a shared set of data for about 1,300 healthy controls.

Most analyses look at one genetic variant, called single nucleotide polymorphisms (SNPs), at a time. But Heckerman and his team are searching for combinations of SNPs that make people more or less susceptible to these seven conditions. Although this technique dramatically increases the resources needed, the hope is that it will reveal novel insights for creating treatments.

The SNPs are being stored in the cloud instead of on conventional hardware. Researchers are also using high-performance computing methods in the cloud. The significantly cheaper cloud option opens big data opportunities to a wider range of researchers and makes it easier for them to share their data with others.

“For [the Welcome Trust] project, we would need to do about 1,000 computer-years of work,” says Heckerman. “With Window’s Azure we got that work done in about 13 days.” The analysis revealed a new set of relationships between genetic variants and coronary artery disease, which are now being followed-up. Says Heckerman, “With the huge amount of data that’s coming online, we’re now able to find connections between our DNA and who we are that we could never have found before.”

Easy route to global models

If understanding the human body represents a complex problem, consider the challenges presented by the planet and its ecosystems. As head of the computational ecology and environmental science group at Microsoft Research in Cambridge, UK, Drew Purves leads an ambitious research programme with the ultimate goal of predicting the future of all life on Earth. This work is critical in bridging the gap between science and effective environmental policy.

One priority is the development of analytical “pipelines” that connect large volumes of data to models and then to predictions. Climate scientists, for example, may wish to use data to produce a model to predict how climate change will alter ecosystems. Alternatively, they may want to use a model’s predictions to create data, such as how changing ecosystems will influence further climate change.

“We know, fundamentally, how to build these models,” says Purves. “But the technical barriers at the moment are so high that it’s the domain of specialists, which means that only the world’s largest organisations can afford to support that kind of data-to-prediction pipeline.”

So Purves and his team have created a new web-browser application, called Distribution Modeller, which aims to make models and big-data analysis more accessible. A researcher can load data into the system – like historical and contemporary wheat production figures for countries around the world – and then call up global surface temperatures and rainfall figures using FetchClimate, a data collection tool also developed by Purves’s team. With the touch of a button, a researcher can create a model linking wheat production to surface temperature and rainfall. It is then possible to compare the model with what happens in the real world, and make predictions about wheat harvests under different climatic conditions.

Big Data Challenges

Along with the many opportunities, data-intensive science will also bring complex challenges. Many scientists are concerned that the continuing data deluge will make it difficult to find data of relevance and to understand the context of shared data. The management of data also presents difficult issues. How do international, multidisciplinary and often competitive groups of researchers address challenges related to data curation, the creation and use of metadata, ontologies and semantics, and still conform to the principles of security, privacy and data integrity? What kinds of business models will emerge to support this costly research? How will government organisations, commercial corporations and a loosely connected community of researchers cope will all these issues?

One thing is certain: scientists who overcome these challenges and embrace the opportunities for cloud computing and big data will see novel and diverse opportunities for exploration and discovery.

This article is an abridged version of ‘Big data is transforming science’, which appears in New Scientist in May 2013.