Predictive Modeling – Assumptions

Recently I was reading, and then decided to blog very briefly about the long-term bias of predictive models when used to drive decision making. As I look at the different approaches taken by scientists / statisticians compared to business-centered analysts I realise there is more to this story…

So just very briefly – our models should describe our data.

empirical models: describe the data we have observed. Great for finding patterns, trends, hidden structure. Useful for prediction, if our observed data encapsulates the expected future behaviour / range of future data.

mechanistic models: more akin to the models we are used to seeing in physical sciences. They are designed to describe a behaviour, and it is likely (or even expected) that this behaviour is stable and will continue to hold beyond the range of observation. i.e. extrapolation is likely.

And of course, all models are inferential – there are certain assumptions that we make before we model and in picking our models:

  • linear models – the response variable is approximately normally distributed. This is sooooo important and yet often is not the case without transformation!
  • discrete vs. continuous explanatory variables – this matters a whole lot!
  • the type of model will be dictated by the data, not the other way around
  • independence of the observations & experimental design will influence our confidence and therefore our success if we base decision off of our predictions
  • ooh, and of course there should be a pattern! (obvious I know, but maybe not always considered)
    With the proliferation of amazing machine learning tools (BigML, Azure Machine Learning, Amazon Machine Learning etc.) I am seeing an increasing number of very suspect examples in the documentation. For sure, the documentation is focused on the workflow and ability of these tools to implement machine learning more so than the robustness of the models. But you would think some thought would go into the accuracy of the models.

    I think we still need some time for these tools to grow before they are truly useful for data science. Primarily, if we are to use them then we are going to want to be able to perform all of our data analysis in the one environment, including: data cleaning, transformation, exploratory analysis and so forth right down into modeling, prediction and production. These tools will need to integrate powerful visualisation tools natively as they mature. And of course, we will all need to be educated in how to produce reliable, robust and ultimately profitable models if they are to be successful.


    Leave a Reply

    Fill in your details below or click an icon to log in: Logo

    You are commenting using your account. Log Out /  Change )

    Google+ photo

    You are commenting using your Google+ account. Log Out /  Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out /  Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out /  Change )


    Connecting to %s