Accurately establishing requirements is critical to success, but that doesn’t imply that this is always an easy process. In our case, this is made slightly more challenging since Matthias is on the other side of the world. (Is this the real world of I.T. with remote teams?) Initially, I misunderstood the situation. I was under the impression that there were still unresolved challenges with the database and I planned to address this by finding a database solution to maximise query performance. However in terms of the request from the webmap and Hothaps, the database performs incredibly well. But that isn’t to say that the database is “perfect”.
While the database works incredibly well for the current situation, there are ongoing challenges arising from the forced table structure. At this stage the table is not really “relational”. It is entirely denormalised, and artificially clustered about the (longitude, latitude) key (which I will discuss in a future post). The result is exceptional response times to mapping requests, but management of the data is difficult (especially for large updates, or when the statistical model is changed). Because research is so changeable, this lack of flexibility is an ongoing problem. Todd, Matthias and I have discussed this, and we use this project to investigate ways of improving the manageability of the data analysis / data storage requirements and focus specifically on:
The database is populated by a combination of experimental observations and heat stress indicators (calculated from the observed results). The database contains approximately 28 million rows, which is not massive, but nor is it insignificant. Like most research, the needs are constantly changing as models are refined, new hypothesis tested etc. Running large updates over the dataset is currently unreliable and may take hours or days. One of the critical challenges for PRJ702 will be to provide a standard class of data analysis tools and to benchmark these to provide a reliable upperbound to their running time.
Integration of data analysis and data storage
Ideally, the process of data analysis and data storage will be as self-managed and automated as possible. PRJ702 will investigate methods to integrate the data analysis and data storage tools. Our initial hypothesis is that this may be achieved by interfacing with the database with a popular programming language and an appropriate API (e.g. connect to MySQL via Python and the MySQLdb API). If there is time, we will aim to compare the effectiveness of this approach using relational and NoSQL data storage techniques (hopefully satisfy this itch I have to investigate NoSQL databases!).
|# 1||Requirements established, “problem” accurately identified, 2012 data loaded into database||August 23rd, 2013|
|# 2||Programming language chosen for Data Analysis and relevant APIs chosen for database interaction, Database connection tested and test data loaded||September 6th, 2013|
|# 3||Data Analysis Class and relevant Methods Designed||September 26th, 2013|
|# 4||Data Analysis Class tested and experimental data loaded into database||October 11th, 2013|
|# 5||Data Analysis Class Benchmarked and reliable upperbound established for each module||November 1st, 2013|