In the previous post, we outlined one of the major challenges facing analytics here at ThriveHive; segmenting a business population that may not fall neatly into existing classification systems. We came up with the idea of "product mix", where each business could be thought of as a combination of the products they offer. The best source for descriptions of these products that is both descriptive and widely available isbusinesses' website text. Using a set of methods from the Data Science toolkit known as "topic models" we aimed to extract a set of common products and describe each business as a combination of these.
The ubiquitous website scrape
Any good Data Scientist has gone through at least one web-scraping tutorial. If you're interested in an in-depth guide I can recommend a couple:
From websites to vectors
Natural language processing is, essentially, using computers to extract information from human language. Right now we have a large collection of text, but we need to turn that text into information that can then be used by our topic modelling algorithm. One of the ways of doing that is counting which words appear in which websites. In that way, a website becomes a series of numbers (a vector), in which each number is a count of the number of times a specific word appears in the website text.
So if the website text was:
"We sell red shoes and blue shoes!"
The word counts would be:
We: 1, sell: 1, red: 1, shoes: 2, and: 1, blue: 1
Thus the vector would be: 1, 1, 1, 2, 1, 1
With a several thousand websites, we obviously have a large number of words, so we'd subset just words likely to be informative (i.e. remove words like "and" and "or"). What we'd be left with is a vocabulary and each website could be transformed into the same length vector of numbers. This is a bit of a simplistic overview, but if you'd like additional details, take a look at some word vectorization strategies here .
From vectors to topics
Now that we've extracted some structured information from the unstructured text data, we can now make use of a topic model algorithm for identifying the topics present in the text. In the last blog post I briefly gave an overview of Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization (NMF). LDA is a probabilistic model, which means it assigns probabilities that each document is part of a particular topic based on the words it contains, while NMF is a matrix factorization method, which attempts to weight words and topics in a way that their product approximates the input data. The outcome is similar though; a set of words that fall under a particular topic and a set of topic weights for each document.
With the word vectors, similar businesses are likely to use similar words to describe themselves. For example, a fitness club and a yoga school might both use "exercise" and "health". Since we have a lot of fitness-related businesses in the ThriveHive ecosystem, we're likely to see a topic emerge that has a lot of these fitness-related terms in them. In the chart below, you can see some of the topics that emerged along with how strongly a particular word is associated with that topic.
Circles are sized according to "relevance" to each topic. For more detail, see the components section in the scikit-learn documentation .
Topic 2 we could describe as our "fitness" topic. Topic 3 might be described as our restaurant/food service topic. Below, I show a couple of examples of companies and how much the words used in their websites fall into these different topics:
The restaurant, as would be expected, fall pretty completely into the food service topic. The yoga school falls into the exercise topic, but also into the education topic. This gives us the "product mix" of the yoga school; some percent exercise, some percent education. There's also a bit of allocation to the "food service" topic, which requires some additional investigation.
In the results above, I presented a couple of examples identifying that the topic model that worked reasonably well. But that was a pretty qualitative assessment, we'd expect that if we're going to apply machine learning in this way, we'd be able to have some metrics to measure the quality of our output. However, with unsupervised learning techniques like these where we don't have a ground truth against which to test our output, it's difficult to quantify something like "accuracy".
However, there are a number of metrics that can be applied to identify how "stable" these topics are. Our models seemed to perform pretty well, but our main concern was whether these topics were meaningful. That is, are we getting anything out of this whole process that improved our understanding of our customers over just using other identification systems like NAICS?
To measure this, we ran some simple models to predict customers' performance on a variety of measures (e.g. website visitors) using a top-down classification in one set of models and our topic mix in another. We found that using the topic mix explained 20% more of the variation in these outcomes than the other system. However, if we included both top-down classification and the topic mix, we achieved another 10% boost in our explanatory power. The suggestion is that we should use both in our exploration of our customers' behavior and performance.
This effort I feel went a long way to getting a better understanding of our customers. At ThriveHive, we're dealing with a really interesting population; business that are flexible enough to adapt to the needs of their customers, even if that means going outside what is "typical" of their industry. With this unstructured categorization, we are better able to target our marketing guidance to be particularly relevant to a business with a particular product mix.
But our business population continues to change. And, like our customers, we also need to adapt. So stay tuned, there's a lot more to come!