Project 702 / Project planning

Defining the Problem

Matthias’s passion and excitement is infectious. I came up against this passion in full flight a few weeks ago, when Matthias described the Climate Chip project to me and ran through the evolution of the Climate Chip database and the difficulties he has faced. It was great, I am incredibly lucky to have this as a project. Matthias has done an incredible amount of work on this, but there are still problems. All along I have thought that the database was the root of these difficulties, however reading through a recent email from Matthias, I am beginning to rethink this.

The Size of the Dataset
Matthias has described some significant challenges with the data. First and foremost is the size of the data. Observed readings are taken across the globe in half degree (longitude by latitude) increments. There are 259,200 grid cells across the Earth, which is filtered down to 67,420 land-based cells. For each of these 67, 420 cells monthly readings have been recorded between 1980 and 2011 (inclusive).

\therefore Dataset = 67,420 \times 13 \times 32 \approx 28 \times 10^6 rows

Statistical Processing
The observed readings include fields: {longitude, latitude, year, month} and some measured temperature readings. The temperature readings are used to calculate various heat stress indexes, which involves a statistical calculation with high coefficients.

Final Dataset (stored in the database)
The final dataset includes: {longitude, latitude, month, year, T(min), T(max), T(mean), DewPoint(mean), WBGT(max), WBGT(mean), UTCI(max), UTCI(mean)}
Where WBGT := wet bulb globe Temperature, UTCI := Universal Thermal Climate Index.

Querying the Data
This is where the database has undergone the biggest evolution. At the web front-end a user selects a particular grid, and this queries the database to return the subset of data that is identified by the grid cell’s location. Initially the database was un-ordered and the query required a full table search. Clearly, this took too long (many seconds). In the next iteration of this, Matthias placed indexes over the longitude and latitude to speed up this process. In addition to this the database is now sorted based on the grid location:


CREATE TABLE orderedData AS
SELECT * FROM unorderedData
ORDER BY longitude, latitude

With this structure, the database only needs to do 1 seek (to find the first entry for the given grid) and then draws the next 416 records (i.e. all of the data that relates to that grid cell). According to Matthias, this now works perfectly – it is blazingly quick and returns in under a second.

All along I have been under the impression that the database was one of the key challenges. However, Matthias now has a workable solution that he said works “perfectly”. It seems that the bottleneck is in translating the observed dataset into the final dataset for storage. So it leaves the question, what is the “problem” that this project will aim to address? I have an initial meeting with Todd this Thursday, hopefully he can give some good guidance.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s