In the previous blog posts we talked about interpretability in machine learning and how to set up your machine learning question.  We also discussed a bit about the application to ThriveHive's data. Now we get to the good part: Running our model and examining our results.

NOTE: The results below are simulated data, which means it is not actual customer data, but it is designed to resemble it.

Before we do, let's do a quick recap of our research question and hypotheses:

Research Question: What is the relationship between use of ThriveHive services and a customer's observed outcomes?

Hypothesis 1: Using ThriveHive services is associated with better outcomes.  

Hypothesis 2: The relationship between the use of individual services and outcomes is affected by the use of other services.

The outcome we'll be examining is the number of leads in a month.  In implementation, it will be the log transformation of the number of leads.  A quick digression on that here:


Why log?

This step isn't mandatory, but often is necessary when you have outcomes like we do in this case (i.e. number of leads).  You can imagine that there are some campaigns that bring in large numbers of leads, but most tend to be distributed around an average.  Look at the distribution below in the box plot:

For help reading box-plots, I recommend this video .  The green horizontal line in the middle of the box indicates the median, or midpoint, of the distribution.  You can see it's ~150. But above the box plot is a mass of months with abnormally high numbers of leads. This suggests we have a "skewed" distribution, in that our distribution isn't symmetrical around the central point.  This isn't necessarily a problem for a regression, but it's definitely easier to detect and interpret linear relationships between features and outcomes when you have a more symmetrical distribution. A "log transformation" is often used to better establish these linear relationships, though there is some debate when and whether it make sense statistically.  

Essentially, I'm making the assumption that leads per month are "log-normally" distributed.  I typically will run a test on the "normality" of this transformation (Box-Cox test), but I sometimes still go ahead if the model seems to fit the data better than without the transformation.

Something worth noting here is that because I've taken the log of the outcome, I can no longer interpret the coefficients as quantifying the relationship between variables and number of leads.  It's now variables and log number of leads. That's okay, though! Because by exponentiating the outcome (raising e to the value), I can (roughly) recover the real value.  It's a bit difficult to explain (more here), but I can also exponentiate the coefficients to get the percentage change in the outcome for a unit change in the variables.  So the regression equation goes from looking like this:

To this:

The exponentiated coefficients are not absolute difference, but the percentage difference (see above link).

Okay, digression over, back to the good stuff!



I won't go into the code used to actually run the model.  I used Python and the statsmodels library to run the regression.  I chose statsmodels because it's one of the only libraries (that I know of) that provides the coefficients on the different features as well as some statistics about the model "fit" (i.e. how well the linear relationship estimated by the model actually explains the observed data).  Remember, a coefficient is essentially the "slope" of the linear relationship between a feature and the outcome.

I've tried to add some annotations to statsmodels' output to make it a bit more readable (apologies if it's difficult to read). I'm focusing here on a specific set of features; the marketing products, the industries and the interactions between products.  Let's go through it.

Outcome: "log_num_leads" is the name of the outcome in the model.

Measures of fit: I'll go into this a bit in the discussion, but basically this measures how closely the model we've designed fits the data we're seeing.  Check out the wikipedia entries for R^2 and AIC .

Significance of the relationship: This measures, basically, how likely the observed relationship is due to chance.  Lower numbers mean less likely. A typical cut-off in social science is p<0.05; if the value is less than 0.05, then the relationship can be considered significant.  There's issues with using hard cut-offs like that, but it's a quick, accessible metric for these early-stage models.

Magnitude of the relationship ("coef"): As I mentioned above, this is the relationship between the outcome and the variable.  All the features that begin with "b_" are binary indicators for different products, i.e. whether a customer had a product in a given month.  So for "b_SEM" the coefficient is 1.12, which every one unit increase in "b_SEM" is associated with a 1.12 increase in log number of leads. Since "b_SEM" is a binary indicator, it is 1 if the customer uses SEM, 0 otherwise.  So customers using just SEM have 1.12 more log leads, on average, than customers who do not use SEM.  

In the digression above I mentioned I can exponentiate these coefficients to get the percentage change in outcome associated with a unit change in the variable.  Here's the coefficients above, but exponentiated:

So you can see here, customers that use just SEM get 306% more leads on average than customers that don't use SEM.  Pretty powerful results. But note I also said "just SEM".  That's because this model has interaction terms, which account for the fact that products likely work together to drive results.  An interaction term is essentially one variable multiplied by another. So say, for example, a customer has SEM and Social Advertising products.  Then to compare those with SEM and Social to those with neither, we'd need to add together the coefficients for "b_SEM", "b_Social" and "b_SEM:b_Social".  So, from this chart, you can see that for customers who use SEM and Social:

That's 353% more leads than those that use neither SEM nor Social Advertising.  

You'll see for products like SEO, there's actually a negative relationship between usage and number of leads.  That's likely because SEO changes take time to implement and for search engines (e.g. Google) to crawl and rank.  Typically, we see noticeable effects from SEO three to six months after product activation. The way the model is currently formulated, each observation (i.e. each customer-month) is treated as independent, so all SEO months are treated the same, even though the first SEO month would be very different from the sixth SEO month.  Currently, the model does not learn relationships within customer.  That's a topic for another day, but if you're interested, look into ARIMA models or an error correction method for observations coming from the same "group" (e.g. customer).


What regression can (and can't) do for you

So you saw above that with a fairly simple set of steps we were able to get some business-relevant results.  Though the model we ended up using for the infographics we put out was somewhat more complicated (e.g. correcting for within-customer effects, including seasonal effects), a fairly simple model with interpretable design can help zero in on some really interesting results.

A couple caveats here: There's a lot going on in the relationship between marketing products and outcomes that should be accounted for at least in some way before running off and bragging to customers.  As I mentioned, there's within-customer effects, seasonal effects and various quality measures of a campaign that should be factored in. And even if you do account for those, there's going to be a fair amount of unobserved variables that require you to closely investigate the "fit" of your model.  

I won't go deep into the concept of fit here, but basically examining the fit means looking at how well the model explains the data and whether the gap between the two stays reasonably constant.  For example, if you see that your model predicts outcomes better for certain products and worse for others, that might mean there's something about products that you're not accounting for in the model.

Generally, though, this method gives you a quick look at the relationships in your data and, with some additional tweaking, can give you some really powerful business insights.  Being able to tell our customers that using SEM products is associated with three times as many leads per month is a really meaningful statement. And it's a message that comes more naturally out of models with interpretability.