Part 1: Set up

Typically methods for segmenting businesses have relied on standardized industry classification systems (e.g. NAICS).  However, the limitation of a top-down industry classification system is that it does not account for business whose products straddle different industries.  This is particularly problematic when working with small businesses, as they often do not fit neatly into a classification system designed around large companies.  

For example, a fitness trainer who also sells nutritional supplements would be both personal services and retail industries.  What NAICS code would be applicable here? A gym chain might use one NAICS code, while a vitamin store might use another. At this level, it appears choosing a single NAICS code to describe our trainer/supplement supplier would be somewhat arbitrary.

But where to start?

One source of information on industry comes from tax filings.  However, for small businesses, that information is either unavailable or difficult to access.  And furthermore, as we mentioned above, the industry codes may not be descriptive of the product a business is actually offering.  So we need something descriptive and widely available.

Probably the best descriptive solution would be to talk to one of our subject matter experts here at ThriveHive and have them pick the categories that best describe a business' product.  But with several thousand businesses on our rolls, that's not a scalable approach. In other words, that information isn't widely available .

But there is a data source that meets both of these criteria: Website text data.

A problem arises here, though, in characterizing this text.  A simple approach would be to search for keywords like "fitness" or "computers".  But it's difficult come up with a comprehensive list of these words and, even if we did, how do we determine which words are more "typical" of one industry versus another? For example, the word "online" may be used by more than just technology companies; a company offering fitness classes may give you the ability to sign up for classes "online".

A tale of two learnings

In Machine Learning, two of the major types of learning are supervised and unsupervised.  In supervised learning, a model is trained to predict a target value based on a set of inputs. So, for example, many of the image classification models (e.g. "is this picture a cat?") you might have read about are supervised learning approaches.  A set of images tagged with their class (e.g. cat/not cat) are used to train a model, which is then evaluated based on its ability to correctly classify the images. The key feature is that the inputs have a "correct" value connected to them.

In unsupervised learning, there is no "correct" value.  These approaches aim to identify patterns or groups in sets of inputs.  One set of unsupervised models that have been applied to text data are topic models.  Topic models take sets of text data and extract clusters of words that are characteristic of certain topics in the text.  For example, in a 2012 article, Dr. David Blei described a topic modelling approach to identify topics in a series of science articles. The model assigned a certain percentage to each topic for each article.  So an article about analyzing brain images might fall partially under a topic characterized by neurology-related terms and partially under a topic characterized by computer-related terms.

This topic model approach allows us to characterize business' website text without imposing the strict structure of a set of keywords.  We tried two types of topic models: Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF).

Briefly, LDA treats the observed text as the result of sampling from bags of words with different probabilities of being selected.  Each bag is associated with a different topic, with the same number of words in each, but different probabilities for each. You can imagine the "neurology" bag described above as having a lot of slips of paper with the word "brain" in it, but only a few slips with the word "computation".  The model then estimates the probability the observed text came from a specific bag (topic) and assigns that bag that probability. Therefore we get a probability for each topic and can characterize the "topic mix" of a text.

NMF, on the other hand, is a matrix factorization approach, which means it attempts to find sets of numbers that, when multiplied together, approximate another set of numbers.  In the case of NMF for topic models, two sets of numbers being multiplied together are the "topic" weighting of each document and each word in order to get the observed word-document data.  A simple way of thinking of it is taking the input data and breaking it down into two components which, when multiplied together, roughly reconstruct the input data. One of these components is the allocation of a specific document to a specific topic.

Now that I've set up the motivation and approach, in the next post I go through the application, the results and how we're using this approach to improve our understanding of the ThriveHive customer base and better personalize recommendations.

Read the implementation details here!