Experiments / Progess / Project 702

DataAnalysis Tool and Database Thoughts

This week I have been working on packaging up the Climate Chip code into working classes. I have decided on 3 main classes:

  • HeatStressIndicators
  • ClimateDatabase
  • DataAnalysis

The HeatStressIndicators class contains the code for the calculation of WBGT and UTCI.
The ClimateDatabase class contains the code to connect to the climate database.
The DataAnalysis class calls the methods within the other two classes and applies these to the meteorological data, before writing the results back to the database.

To test this, I had the DataAnalysis class read in the meteorological data for 2011, calculate WBGT and UTCI, and then write this to a new database. This process was run 10 times, average run time was 244 seconds (a little over 4 minutes). This running time was dominated by the WBGT calculation and the database write, as shown below:

Fig 1: Running Time Analysis of DataAnalysis Class

There are a number of areas to explore from here:

  • Comparison of iterative WBGT method to approximate WBGT method
  • Explore alternative methods for writing to the database. The UTCI calculation is applied to the full array of inputs (c.f. needing to iterate over the inputs and call the calculation), this is massively fast. I wasn’t able to do the same when writing to the database, so I resorted to iterating over the output array. I will need to explore better ways to write out.
  • Move the database writing function to the ClimateDatabase class
  • Create a fourth class to read in the experimental data

Some of these could yield significant improvements, however the system is working and working quite nicely now. So these changes will be improvements as opposed to being strictly necessary. More importantly, I need to being to address the database by exploring:

  • Creation and maintenance of indexes on the table
  • Effects of table clustering on performance
  • Efficient methods for adding / deleting columns from the table
  • Efficiency of adding new rows to the table (which will require indexes to be updated and table clustering to be rerun)
  • Performance of updating existing data in the database. I imagine that there will be a threshold where it is more efficient to drop the database (or large chucks of it) and completely re-do it, versus updating the existing data

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s