For somebody who wants to become a DBA and work with relational databases every day, I am about to walk into dangerously radical anti-relational territory. Here is the pivot: the more I begin to appreciate the data needs of the Climate Project, the less and less enthusiastic I am about using a relational database for a non-relational data set. The thoughts below are my own, and are not intended to be a comprehensive overview of relational databases, but are important thoughts as I decide where to move forwards with the Climate Project.
First and most importantly, I believe relational models impose structure. A relational model is based on first-order predicate logic and can be represented by a finite set of properties and operators. I will extend on this and say that data and its relationships must be agreeable with mathematical logic in order to fit a relational model.
What does this mean for data storage and data access? Relational models store data in very rigid schema. Context is provided through the use of relationships and joins but, the data is held in a very generic manner (contrast this to data warehousing, where data is “pre-joined” and held in a contextual fashion). This provides highly flexible query structures, as long as the query is compatible with the database schema. It also provides a declarative system, where the user can express what information they want, and leave it up to the DBMS to figure out how to retrieve it. Couple this with the security and integrity features of a good RDBMS, and you have a powerful enterprise-capable database that will manage your business data storage / access needs reliably and efficiently.
Here’s the problem with the climate data set; we don’t need any of the above.
The climate data set is mostly static. Data access is mostly made up of reads and the queries are well defined (i.e. we can predict what the queries will be with a high level of certainty). There are few write operations and when we do need to write in data, there is a very good chance that we need to modify the existing schema. So what do we need?
- Quick and convenient INSERT of experimental data from file
- flexible SCHEMA
- quick KEY : VALUE lookups
- facility for AGGREGATE queries (??? I need to think more about this and whether we really do need this)
- seamless INTEGRATION of: experimental data extraction, heat stress calculations, data storage and data access (via website)
At a glance, this list looks like an advertising campaign for a NoSQL database. It would reduce the complexity of the problem a lot, if we could achieve all of this with a single tool. With Client-APIs to NoSQL databases we could use one language and one tool throughout. I wonder too if a simpler, less complex system will allow for higher-volume data processing. Currently, we receive yearly climate data reports. Could we handle monthly, daily, or hourly volumes?