I am really looking forward to project this semester – it is a great opportunity to thoroughly sink my teeth into something that interests me. So what is my project, and what do I want to achieve in the process?
Data storage and management is a massive industry. People place a lot of importance on data and it has become one of the biggest commodities in the modern economy (think of Wall Street or Google and the immense economies that they drive). ‘Data’ has almost become its own being, much like a mythical fairy or dragon!
When you use the term ‘data’ people associate qualities and human-like attributes to the term to the point that ‘data’ now holds an almost figurative presence in our society. But by itself data is meaningless. Raw data is not this ephemeral beast with a life of its own. It is not multi-hued or multi-faceted. It doesn’t hide the secrets of the universe. It is our ability to take data – interpret it, massage it and manipulate it in a way that provides insight, value and creates knowledge that is truly awesome. The world of data storage and manipulation is enormous. There are some incredible technologies out there to help us and it is these technologies that I want to explore.
Over the last few decades the world of data management has been dominated by relational databases. To me, relational databases preserve the generality of data without projecting external assumptions or qualities onto the data set. Yes, relational models are built to support business goals and logic but this is another layer that overlays the data, which is itself still held reasonably pristine. I think this is one of the strengths of the relational model that the data holds integrity and if designed well, is not affected by the state of the system. However, over the years we have seen various other tools rise to challenge relational databases and perhaps some of the most interesting developments have been only in the last few years with the rise of big data.
[need to flesh out the above paragraph, find supporting (or not) articles and check / balance my opinions against the accepted research]
We are currently experiencing one of the most exciting (IMHO) revolutions in the world of data – the rise of big data! Big data, that is truly massive sets of data, have pushed the boundaries of current data storage and manipulation technologies. And it’s not just the size of the data sets, it is the speed at which it is being generated. How do we go about recording millions of Google search requests a day, or thousands of transactions on the stock exchange minute by minute, or route traffic across an ever expanding internet? Data warehousing and non-relational (NoSQL: not just SQL) databases are playing a significant role in the world of high-volume data. But it is still pretty new, we have yet to see these technologies find their way into our education systems, and relational databases are still the powerhouse of enterprise data management.
What is the most effective strategy for storing our data? This isn’t intended to be a debate on SQL vs. NoSQL, I am not interested in definitively declaring a winner in this war. I want to know: for any given set of data what characteristics lend it to using relational systems, or what are the characteristics that would be better managed in a non-relational way? Are there certain types of data that are unsuited to either system? Given time, I would like to gain more practical experience with relational databases (particularly SQL Server and MySQL) and it would be great to explore (practically) a NoSQL system. But ultimately if I can answers the questions here even without achieving anything else, I will still be incredibly satisfied.
I have a feeling that the project I will be working alongside might be immensely suited to exploring this question. I suspect that the data set contains some really interesting characteristics that put it somewhere in the grey area between relational and non-relational management systems. I think that the simple fact that SQL Server and MySQL have special database engines to support spatial data indicates that geographic data does not fall firmly into a relational model. I also think that there will be concepts of data warehousing that are suitably applicable to time-dependent spatial data. But which is more effective? Which provides more stability / integrity? Will either provide the flexibility that is inherently required in a research environment? What are the trade-offs, is there compromise to be considered?
I can’t wait!