ThriveHive has been putting out a couple of exciting infographics based on some of our work here on the Data Science team.  I thought it might be worthwhile to dig into the methods a bit and show how you don't need deep learning to get some really powerful business insights.

There has been considerable discussion about the importance of being able to interpret the output of machine learning models.  For the purposes of this article, I will be defining a model's "interpretability" based on the mathematical logic definition; the model's output is able to be described in plain language.  And by output I don't just mean predictions, I also mean the parameters being estimated.

As I see it, there are three major reasons interpretability in models is important:

  1. Diagnostic: There's a number of issues that can arise with models including leakage and overfitting.  Being able to understand which features the model is using to draw its information is essential in diagnosing those problems.
  2. Ethical: As machine learning models become more widely used, there are concerns about what information the model is using to make its predictions.  For example, many word embedding models tend to have a certain gender bias due to the corpora on which they are trained.  NPR's Ted Radio Hour did an interesting dive into the ethical concerns around predictive models.
  3. Applicability: Machine learning practitioners are recognizing that a lot of problems can't be fully solved by an "accurate" prediction.  Beyond knowing the "what" of the answer, it's important to also know "why".  This "why" question is often the one we're seeking to answer here at ThriveHive.  More than what companies we expect to see great results, why do those companies see those results?

There are a number of fascinating efforts to bring more interpretability to machine learning.  In this post I'll be focusing on a class of models called linear models, but potentially I'll do a future post on efforts to add interpretability to nonlinear models such as LIME and Shapely values.

Linear models

Linear models model linear relationships between variables.  A good example would be the relationship between the temperature in Celsius vs the temperature in Fahrenheit.  There's a straightforward, linear relationship between the two:

Linear models essentially learn all the parameters necessary to calculate one variable (the "outcome variable") from a set of others (the "feature" variables).  This is the equation for a basic regression model with two features:

Here, y is the outcome and x1 and x2 are the features.  The epsilon (e) is what is called "error", or the difference between the outcome and what is calculated by the equation.  We'll come back to that in a minute. The model learns the B parameters from the data.  Typically this is done by a process called "least squares estimation", which I won't dive into here, but you can feel free to dig deeper.  I particularly like this explanation of it.

The great thing about these linear relationships is that they're pretty readily interpretable.  Take for example the equation above. What happens if you add one to x1?

Adding one basically increases the calculated "y" value by 1.  Basically, B1 is the change in y for a one-unit change in x1.  You could also think of it as the "slope" of the line on the graph of x1 vs y:

And that's exactly what it is! A linear regression fits a line to the relationship between X and Y, and the slope of that line is .  As a quick example, if we're converting, say, dollars to Euros, we know that the exchange rate (as of today) is 1 USD to .85 EUR, or ( 17/20)*USD = EUR.  Each additional dollar we add will increase our number of Euros by .85.  So we could say, according to this conversion model, the effect of my paying you $10 would be that you would have 8.50 more Euros.  Not a stunning revelation, but it gives you some of the intuition.

Now I hear you saying: "That's so simple, let's linear model everything!".  Well, you might have noticed I used pretty simple conversions here of temperature and currency.  That's because in real life it's very hard to find a perfect linear relationship. Linear models find a line that best fits the relationship, allowing for some inaccuracy (i.e. "error" as we mentioned before).  But there is the assumption that a linear fit is the right one and often that is not the case. There's a number of ways of dealing with that, including adding other features into the model.

Take for example age and its relationship to height.  It looks like this (from the CDC):

You can see that the relationship is approximately linear up until about age 15.  A single-feature linear model here probably wouldn't work, as it'd try and fit a single slope to the age feature, which means predicting every 30 year old to be around 8 feet tall.  If that were true, I might have taken up basketball instead of Data Science. Instead, it might make sense to add into the model an indicator for whether someone is older or younger than 15.  Then you would get a slightly better fit, as the slope of the line above would try to approximate the overall trend. Even better is if you allow the slope to vary by that older than 15 variable (i.e. include an "interaction term").

This graph shows the predictions from each of the three models; one with just age, one with age and an indicator for being older than 15 and one which adjusts the age-height relationship (slope) for people older than 15 (interaction).  You can see by comparing these lines to the CDC chart above that the interaction model (green line) best approximates the actual relationship. Though, it's not perfect. Again, the model learns linear relationships, so it can't, in its current formulation, fit the curve you see in the real data.  There are ways of having linear regressions approximate non-linear relationships, but interpretation gets complicated.

So that gives you some background on the approach we used to generate some of the statistics for the ThriveHive infographics.  In the next post, I'll actually walk through that approach and discuss how using these interpretable model enabled us to draw some really impactful conclusions.