Project 702 / Project planning

Project Direction – Derailed!

Last week Matthias sent me a great email that described his challenges with the Climate Chip database and the evolution that he has taken it through. It is a fantastic email, as it really helped me to establish the current “state of play” and establish the launchpad for this project. However, when I carefully read over the email I noticed the following comment:

“The resulting table (xyz2) performs perfectly for the described processing scenario.” Matthias Otto

Oh no! All along I have been operating under the assumption that the database didn’t function efficiently and it was an ongoing challenge. And here, Matthias clearly states that the database works “perfectly”. This has the potential to seriously derail the direction I have been following.

The Journey So Far
I have summarised this first part of the journey in a Prezi (you can see it here). I have followed a somewhat meandering path so far and it is in serious jeopardy now. However, all is not lost. I believe that by going through the journey (so far) I have unveiled different qualities and characteristics of the Climate Chip dataset, summarised below:

Investigation area Assumptions Findings
NoSQL Systems Hypothesise that a NoSQL system will provide the flexibility that is required Difficult to get a sensible summary of the pros and cons between NoSQL dbs and relational dbs. Public information sources are dominated by NoSQL-evangelists and relational-die hards. Very biased and unconstructive. So, moved away from this.
Looks like relational databases will handle this volume of data perfectly well. Attempt to explore IMPROVEMENTS to the current solution before investigating alternatives
Geodatabases Leverage spatial indexing for improved performance Along the way I discovered geographic information systems (GIS). A GIS involves a specialised geodatabase that support spatial indexing and spatial relationships. However, the Climate Chip Project is not utilising spatial relationships (at this stage) so it doesn’t seem sensible to take a performance hit for a feature that they are not currently interested in.
RDBMS The current database is not perfect. On the back of Matthias’s email, I will focus on the current solution and assess its appropriateness First I normalised the data model before evaluating the normalised model against the interaction with the webmap. When a user clicks on a grid cell, the database is queried and returns the entire 32 years worth of data for that grid cell. This leads me to believe that the exact (month, year) values can be replaced with a relative time index (i.e. 416 months of data => {index0:”Jan 1980″, index1:”Feb 1980″, …, index416:”Dec 2011″}). This led me to think about storing the data in “buckets” (i.e. Bucket0 = Grid0, which contains all 416 readings), if you could query for a bucket and extract all 416 readings in one go, then surely this would be efficient. This seems a lot like a hash table to me.

And so I have come full circle: a comparison between relational and non-relational systems. However, this time I have a more specific question:

Goal / Assumption: A hash table will provide the most efficient lookup times and the database will “manage” the file organisation.
Question: Will a hash table negate the need to rigidly structure the order of data in the table (which is difficult to manage across updates)? And which system (relational or non-relational) is better suited to support key-value structures?


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s