Cloud policy, big data and the long tail of science
Microsoft has been closely involved in the incubation and activity of SIENA and its preceding projects. As the EC funded Venus-C project and our other cloud research engagement collaborations around the world approach the end of their second year, we have taken stock of the lessons we have learned and we chose to share our most interesting findings with the Cloudscape community.
Cloud Computing and the rise of the data tsunami are creating a revolution in scientific research. The explosion of data is transforming every field of inquiry. Biologists, astronomers, anthropologists, sociologists, political scientists, geologists are all now data scientists. They are recognizing that access to data on a huge scale and cloud-based statistical and machine learning tools enables them to make discoveries they could not have made before. The vast majority of these scientists have never had access to supercomputer networks and they relied mostly on desktop resources. In terms of the distribution of resources and the numbers of researchers and their productivity they are the ‘long tail of science’: they number in the thousands for each traditional supercomputer user. They are also the researchers that will create and inspire the innovations for startup companies that will leverage the economies of the cloud to build the businesses we need to drive the economy.
The cloud has the potential to democratize science by providing powerful computing and data analysis to any researcher. But there are substantial challenges ahead that must be addressed before this revolution can be fully realized. Most of these challenges are not as much technical as they are policy related. The key policy issues we have identified are:
- The capital expenditure (CAPEX) funding model for research infrastructure. Funding agencies often have policies that require computing to be a physical acquisition that generates a fixed capital expense. Cloud computing services are consumed as needed and don’t fit well into science grant budgets and when they do, they are taxed with overhead charges that make them extremely expensive.
- In many countries there are now requirements that all data produced by government funded research be made available to all interested researchers in a permanent and sustainable fashion. Like all Open Data efforts, this is desirable, but how can we find financially sustainable ways to make this possible? Are there ways to provide value-added analytical and search services that make using data easier?
- Data sovereignty is an important topic and it is important that we have globally harmonized data protection policies, but not all data is created equal. Scientific data has value that is global in importance and it must be able to flow across national boundaries without constraint.
The cloud is increasingly becoming the point at which all of world’s digital instrument data, economic and social data streams converge. Once in the cloud this data can be mined to help us avert famine and track the sources of contagion. Data analytics on these mashed-up data sources can optimize our use of carbon and make our global economies more efficient. But we must find ways to keep this capability from being used against people. We have work to do. Can our scientific research establish ways in which our data can be better protected and put under our control?
Of course there are other, more technical challenges to making the cloud productive for scientific use. Chief among these are the fact that the cloud requires a new generation of data exploration tools that can be used by researchers as easily as they use a desktop spreadsheet program. However, once that has been mastered many new and creative approaches to scientific discovery are possible.