For a few weeks now I have had a good idea of what my project will be centered on. In this sense, I have been lucky, because it has given me some time to explore the area quite widely and just sit with it, let it bubble away in the back of my mind. But now we are here, at the start of Semester 2, and it is time to put some thoughts in writing and get a firm idea of what direction I intend to take.
Data Management Strategies for Effective Storage and Access of Global Climate Data
Matthias and Ryan are working on a project to map global climate data. They collect weather information from approx. 70,000 points across the earth and display the trends via a webmap (http://climatechip.org). The project is live and meets the end-needs; to produce a “series of reports and other interactive resources on the health effects of climate change” (from the website). However, from a research perspective the data is somewhat unwieldy, it is large and cumbersome and it can be a slow process of change when presented with changing requirements.
Questions to consider:
1. How is the data currently organised / managed?
- choice of db technology
- use of normalisation vs. denormalisation
- use of indexes, keys and referential integrity
- performance – what are the problems (because there are some) and do any of the above questions have a direct influence on these problems
2. Performance benchmarking (with the current technology i.e. relational DB)
- Establish reliable upper bounds for query performance
- Hypothesis on the affect of considerations from (1)
3. Geospatial data
This project is quite a specialised project, but it isn’t unique. It falls into the area of geodata – which is the managements, analysis and sharing of geospatial data. Investigate GIS industry-standard approaches to data storage.
- What technologies are in common use?
- What is it about geospatial data that is unique / lends itself to these technologies and approaches?
- Can we implement this within MySQL (current RDBMS), or should we compare performance and usability with another solution?
4. This project also borders on the world of Big Data
- are there concepts within the Big Data world that we can leverage?
- what are the improvements / limitations of adopting a big data mindset?
5. Flexibility & usability
I think this project is really interesting. The great thing (from my selfish point of view) is that the data management issue are relatively unexplored. In the end, the data management falls way down the priority list compared to the web mapping. Therefore, it is important that any outcomes from this are focused on ease of use, flexibility and maintainability. A highly tuned system that takes a lot of effort to maintain is not going to be useful – ideally we will find a “low-effort” solution that is easy to maintain.
Why is it interesting to me?
So many things, it is right up my alley. Through this process I should be able to get hands-on experience and explore the world of data management. I am really looking forward to exploring relational and non-relational technologies and learning to recognise the qualities and characteristics of data that lends itself to particular storage solutions. Ultimately, my feel and apprecation for data will grow and I hope to be able to identify useful technologies for a range of data needs.
Three things I know about it
- It is a reasonably large dataset (approx. 3 million records) and it is the size that is proving the biggest challenge at the moment
- Relational databases are king, but there is a growing need for flexible, high-throughput solutions that has seen the rise of NoSQL systems. When do you use one over the other and can you combine them for the best of both worlds?
- GIS is a well-established field and RDBMS provideres have provided varying support for spatial data types
Three things I believe about it
- That with geodata, the line between relational and non-relational datasets is blurred (just consider the difference between vector data and raster data for a start)
- That SQL Server and MySQL offer support of spatial datatypes, but that this is still possibly a limited option (see the growth in popularity of PostGres and PostGIS in the opensource GIS industry)
- That the changeable nature of research will be the largest challenge to address in regards to data storage.
Three things I don’t know about it
- Why the need for dedicated spatial datatypes? What do these allow that can’t be otherwise achieved?
- How will the data storage be dictated by the web-mapping front-end?
- Will these be a significant performance improvement through the use of a geodatabase, or is this purely more in context with the problem domain?
My gut feeling is that normalisation may not yield significant improvements. I think that the data is quite “flat”, i.e. I think the data describes a single entity, not a group of entities. Therefore, I think the real benefits will be in improving indexing and partitioning of the data so that we can access it faster. This is going to be very interesting!