Monstrous Scientific Data Management – Problem or Opportunity?

We believe that most new science happens when the data is examined in new ways. So our focus here is on data exploration, interactive data analysis and integration of datasets. (Gray et al (2005))

The SKA Telescope is projected to go live in 2020. It is a square kilometer array (SKA) of radio receptors capable of scanning and recording a million square kilometers of the sky and is forecasted to produce 300 PB of data a year!!! ( This is a mammoth monster that dwarfs even CERN’s data volume.(LHC Computing Grid). The big question, is how on earth are they going to handle that much data?!

I worry that the popularity of MapReduce and distributed analysis projects like SETI at Home (SETI@HOME) will hold sway and a similar setup will be the obvious choice for the SKA project. The problem I foresee is that this limits the innovation and restricts analysis to bite-sized chunks of data that fit in RAM on commodity machines. High-sensitivity machines are designed to reduce noise and increase resolution which makes it possible to identify smaller haystacks hiding in the noise. And for this, bite-sized analysis is imminently suitable. But as Gray et al point out, true innovation will stem from advanced filtering, transformation and integration of data sets. The challenge now is to view, compare and merge large independent data sets to obtain new perspectives and views of the data.

In the biological sciences, there is precedent for using database technology to do just this. Phillipi and Kohler (2006) present a database-centered architecture for the exploration of biological data sets:

(Phillipi and Kohler (2006))

I am beginning to believe that this is the future of scientific data management. Functional parallel computing, like MapReduce, is ideal for optimised distributed computing on bite-sized data partitions, but scientific data analysis is being hampered by the limitations of RAM and in-memory algorithms. The big question is: who is going to host this data and manage the database services? Traditionally, these services don’t come cheaply and will this prove to be the limiting factor in scientific database integration?


Gray, J., Liu, D. T., Nieto-Santisteban, M., Szalay, A., DeWitt, D. J., & Heber, G. (2005). Scientific data management in the coming decade. ACM SIGMOD Record, 34(4), 34-41.

Philippi, S., & Köhler, J. (2006). Addressing the problems with life-science databases for traditional uses and systems biology. Nature Reviews Genetics, 7(6), 482-488.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s