Over night, I dropped the climate chip database and repopulated the entire 30 years of data from the raw data files. This is the first experiment in the suite of experiments for benchmarking the database performance. The results, observations and future thoughts are below.
Step 1: Extraction of observed meteorological data. I don’t have access to the full 30 years of met data, so this step will reverse-engineer the source files from the database. Raw observed data (timestamp, latitude, longitude, Ta) was extracted from the existing database and written to file.
Step 2: A new database, climate30Years, was created with a single table, cru30Years. The DDL for cru30Years is shown below:
CREATE TABLE cru30Years ( latitude float NOT NULL, longitude float NOT NULL, year int NOT NULL, month int NOT NULL, tmax float, tavg float, tdew float, wbgtMax float, wbgtMid float, wbgtAvg float, utciMax float, utciMid float, utciAvg float);
As the first database experiment, there are no indexes on the data. This is raw data storage, and will act as the control experiment for the comparison of indexes, normalisation, table clustering addition of new fields and updating of existing data.
Step 3: Meteorological data was extracted from source files, heat stress indicators calculated and the results written to the database. This step was repeated 3 times.
The total runningtime for Step 3 was approximately 4.25 hours. The WBGT calculation and write out to the database took 99 % of this time, with the read-in of experimental data and UTCI calculation both running in seconds.
Table row structure
The database write procedure inserts the new rows in a way that mimics the existing table (i.e. ordered by Year and then month). If we were using Oracle as our DBMS, this would result in the rows for each year being written sequentially on disk, and therefore physically close together (though, I am not convinced that the DBMS would be aware of this and able to take advantage of this for future queries). I am not sure if MySQL writes rows sequentially on a page like this, I need to explore this.
The 2011 and 2010 WBGT running time was ~ 4 minutes each, but these were outliers. All the rest consistently ran in ~ 8 minutes. From the point of view of a repeatable and predictable process, this is an excellent outcome for the project, with a reasonably well defined upper and lower bound for the calculation of WBGT.
However, the WBGT is the greatest bottleneck in the process. Potential performance improvements to explore:
- Test the approximation method for WBGT
- Multi-thread the WBGT calculation. There are 3 separate WBGT calculations for each year, run over 3 separate arrays. It should be possible to multi-thread these and have them run in parallel.
Database Write-out Bottleneck
For all 30 years, it took ~ 70 seconds to write the full 809420 rows to the database. Again, from the perspective of a repeatable, predictable running time this is a great result. However, the write out is still using the simple iterative method => there must be a faster way! Potential improvements to explore:
- Find a method to write a full array to the database (without explicitly iterating)
- I do not think the db write-out could be threaded, there may be issues with table locking.
A total redo of the database (simple table with no indexes etc.) can be achieved in under 5 hours. This has been shown to be highly repeatable (aka. predictable), which therefore serves as the upper bound on the running time. Potential areas for improvement have been discussed and targeted for future experiments.