Interpreting machine learning models: Picking low-hanging business insights

Part 2: Asking questions

In the previous blog post we examined the importance of interpretability in machine learning models and how to interpret the parameters that are estimated by a class of machine learning models called "regression models".  Now that we have the background set up, let's go into the application to ThriveHive's data.

There's a lot of really interesting explorations of how to set up your machine learning project and what types of questions machine learning can answer.  One of the most important first steps in my experience is to define the question you're trying to answer:

Research Question: What is the relationship between use of ThriveHive services and a customer's observed outcomes?

You'll see already in the question I'm falling into one of the "question types" listed in the link above.  This is a "comparative" question, where I'll be comparing customers that use services to those that do not.  The next step is to define a hypothesis. A scientific hypothesis is a concise, testable statement that, like the research question above, will govern the structure of the analysis.

Hypothesis 1: Using ThriveHive services is associated with better outcomes.  

Hypothesis 2: The relationship between the use of individual services and outcomes is affected by the use of other services.

Here I have two hypothesis: Both centering around the idea that our company's services creates value for our customers.  (spoiler alert: they do ).  The first focuses on the relationship between services and outcomes, the second focuses on how that relationship changes in the presence of other services (see "interaction effects" in the previous post ).

You'll see here that I didn't specify what outcomes I'd be measuring.  My rationale for this is that we need to look at the data to see what operationalization of the outcome makes sense.  However, you don't necessarily need to even look at the data to generate these hypothesis, often they come from interpreting business stakeholders' questions into analytic language.  So let's translate this vague concept of "outcome" into something we can actually measure.

 

Sales funnel outcomes

At ThriveHive, we have extensive information about our customers, the campaigns they run and the results of those campaigns.  A variety of services means a variety of outcomes.  For example, a Facebook campaign has impressions and while an Email campaign has opens.  Since the question we posed above requires that we test different services against one another, we need some outcome metric that is available across services.

Stepping back, let's think about what is a marketing campaign outcome.  Typically a good way to look at the marketing process is to think about the sales funnel:

Source: https://www.prontomarketing.com/blog/what-is-a-sales-funnel-and-why-is-it-important/

Marketing's job is to attract visitors, bring them to your website, Facebook page, etc. and encourage them to express interest through something like a form fill or phone call.  At the moment they express interest, that visitor becomes a "lead". The next step is highly dependent on the business, but typically the goal is to turn those leads into sales.

Our data is pretty rich, but we don't have much insight into actual sales.  The farthest down the funnel we measure is the leads stage. Because these data are available and important to the marketing process, we chose "number of leads" as the outcome for our model.

 

Unit of observation

A major consideration in devising analysis and one I feel often does not get the attention it deserves is choosing the right "unit of observation".  That is, what an observation looks like in our data. People familiar with spreadsheets might think of this as a "row". Again, this should be governed by our question and hypotheses and informed by our data.  

We have daily data available for our campaigns, but my sense is that this level of granularity may be asking too much of our model.  Further, by estimating the relationship of services to leads per day, we might get some strange results like "using our services is associated with 0.5 more leads per day".  I'm not sure how many customers would get particularly excited by half leads. But what if we were to say "15 more leads per month"? That's a bit more understandable (and exciting!).

Further, a lot of the sales and marketing materials for the company talk in monthly costs and returns.  So it just makes general sense to have our model speak this language as well. So we're looking at a monthly "unit of observation".

 

Generating features

The general equation of a regression model (with two variables) is the following:

Now, we've specified an outcome: Number of leads.  That's our y in this case.  Now we need to think about what data we need for the x 's of our model (which will likely be more than two).  Since we're interested in the relationship between leads and services, we'll definitely need to include services we offer: Search Engine Marketing, display advertising, Search Engine Optimization, social advertising and email advertising.  These are just a subset of our services, but are ones that are our most popular and the most likely to be closer to lead generation.

One big question is how best to incorporate these leads as variables in our model.  We tested a few different ways, but for this post I'll be focusing on the services as indicator variables .  They're either zero or one; zero if the month did not have the service active, one otherwise.

It's also definitely the case that a business' characteristics (e.g. industry, size) are associated with the number of leads they generate.  For example, a restaurant likely has a large number of people making reservations, while a car dealership will have fewer, more high-value customers.  In modelling lead generation, we definitely need to include business characteristics like industry, size of business, location and how long they've been a customer with us.

So to review: We're developing a model in which the outcome variable (dependent variable) is number of leads in a given month, the features of interest (independent variables) are indicators for the use of five different ThriveHive variables and the additional features (covariates) are business characteristics.

Since this post is pretty dense as it is, I'll leave off here until next time.  In the next post I'll get into the results, some next steps and some tips for applying these methods to answer your own analytics questions.  Until then!