Recently I was reading, and then decided to blog very briefly about the long-term bias of predictive models when used to drive decision making. As I look at the different approaches taken by scientists / statisticians compared to business-centered analysts I realise there is more to this story…
So just very briefly – our models should describe our data.
empirical models: describe the data we have observed. Great for finding patterns, trends, hidden structure. Useful for prediction, if our observed data encapsulates the expected future behaviour / range of future data.
mechanistic models: more akin to the models we are used to seeing in physical sciences. They are designed to describe a behaviour, and it is likely (or even expected) that this behaviour is stable and will continue to hold beyond the range of observation. i.e. extrapolation is likely.
And of course, all models are inferential – there are certain assumptions that we make before we model and in picking our models:
With the proliferation of amazing machine learning tools (BigML, Azure Machine Learning, Amazon Machine Learning etc.) I am seeing an increasing number of very suspect examples in the documentation. For sure, the documentation is focused on the workflow and ability of these tools to implement machine learning more so than the robustness of the models. But you would think some thought would go into the accuracy of the models.
I think we still need some time for these tools to grow before they are truly useful for data science. Primarily, if we are to use them then we are going to want to be able to perform all of our data analysis in the one environment, including: data cleaning, transformation, exploratory analysis and so forth right down into modeling, prediction and production. These tools will need to integrate powerful visualisation tools natively as they mature. And of course, we will all need to be educated in how to produce reliable, robust and ultimately profitable models if they are to be successful.