It's pretty clear that Google searches are a key way people find businesses.  35% of referral traffic comes from search, versus 26% from social, a difference that appears to be growing.  And much of this traffic comes from local search, there's been an explosion in the number of "near me" searches according to Google.  Responding to this, Google created a mobile-ready "info pane" for businesses that facilitate searchers' local exploration.

There's a lot of interesting information here.  In fact, this panel is so informative that it's beginning to alleviate the need for searchers to click-through to a website , since all the information is right there.  In this post I'll be focusing on is the Q&A section, where people are able to ask questions about a business, which are then answered by the business owners or local guides.  

Now, I won't make the case here about how important this section is to a business, that's a point better made by others.  But despite its importance, there's very little guidance written about it.  A business owner would likely be interested in what kinds of questions customers have for a business like theirs.  Armed with that knowledge, they may be able to proactively ask and answer questions in this section or make answers more clear in their descriptions and promotions.

So here's our research question: What are commonly asked questions for businesses in different categories?

Here we will discuss the methodology we applied to answer this question.  To see the result of this analysis, check out the interactive tool .

Categorizing businesses

Though a manual review of ~16k businesses, we pulled 64k questions along with some additional information available on the info pane.  One of those pieces of information was category. You can see it listed in the info pane:

This is the "primary category" according to Google.  We discussed business categories and their hierarchies in another blog post .  With some sources, there are hierarchies available e.g. "Restaurant" is a category that includes "Chinese Restaurant".  With Google, there are primary and secondary categories, but a comprehensive list of these doesn't exist (to our knowledge).

Commonly asked questions

Now, how do we identify commonly asked questions? Likely we could think of a few for certain categories just based on what we know about these types of businesses.  For example, Restaurants likely have a lot of questions about reservations and hours. Indeed, from analysis of our data, we found that 13% of questions asked to restaurants used the words "open" or "hour", while 7% of questions had the word "reservation".

But what about for other types of businesses? ThriveHive recently partnered with the Associated Bodywork and Massage Professionals, which has brought a lot of Massage Therapists into our ecosystem.  What kinds of questions are common for them? Just thinking of questions is likely not going to scale, it makes more sense to let the data tell us what are common questions.

 Extracting features from text data

The data, in this case, are the text of questions asked to businesses.  This poses an interesting challenge; how do we extract information from text? This is the main focus of the fields of Natural Language Processing and Information Retrieval, both of which have contributed a wide array of tools.  The one we used is called term frequency-inverse document frequency (TF-IDF) vectorization. In a previous blog post, I walked through turning a document into a term frequency vector, but I left off the "inverse document frequency" part.  In an upcoming blog post, I'll dig into the inverse document frequency method. For now, let's just do a quick overview.

Term frequency (TF) is what it sounds like, how often a term appears in a document.  In this case, a "document" is an individual question. Document Frequency is the total number of documents (questions) a term appears in.  Inverse Document Frequency (IDF), then, is one divided by the document frequency.

IDF aims to weight each term by its "relevance".  The logic is that common terms are less relevant to the meaning of the document and rare terms are more relevant.  If a term appears frequently across all questions (e.g. "what" and "how"), it's likely not relevant to any particular question.  By multiplying the TF by the IDF, terms that are frequent in a set of documents are down-weighted and rare terms are up-weighted.  

With this method we've gone from unstructured text to a structured dataset; each row of our data is a document or "question" and each column is a IDF-weighted term frequency count.  So how do we group this information into sets of similar questions?

Identifying "frequent question groups"

Methods for identifying common patterns within data are a major part of machine learning and loosely referred to as "unsupervised learning ".  Click the link for lots more information on this very fascinating area of study.  We'll be focusing on a particular method called K-Means clustering, which aims to identify groups of data points that are similar to one another.  In brief, the "K" in K-means refers to the number of groups the algorithm will divide the data into. The algorithm chooses K points in the feature space, which represents the center point of a group.  Each data point is then assigned to the group whose center it is closest to. The algorithm adjusts that center point to be the center of the newly-formed group and repeats until it has found the center points that best split the data.

I think this is best illustrated in a visual:

Source : https://medium.com/@dilekamadushan/introduction-to-k-means-clustering-7c0ebc997e00

You can see here that there's two groups, red and blue (K, in this case, is two).  With each step, the center point (i.e. the bold point) moves to the center of the points assigned to its group.  By the end, you have two nicely defined groups.

So, applying this to our use-case, the algorithm will identify K groups that have similar TF-IDF vectors.  That is, questions that are similarly worded, which is how we're thinking of "frequently asked questions". We add another layer onto this by re-running the K-means algorithm on groups that are less coherent.  That is, groups that are less similarly worded than others. Some amount of this is based on manual quality review, but the result is a pipeline that can parse through large amounts of data and extract clear sets of similarly-worded questions.

Next steps

Our goal is to make this analysis as performant and autonomous as possible.  As we receive new Q&A data from Google My Business, we want to be able to update our existing FAQs and add new questions to them.  What that means for Data Science is developing a pipeline to ingest new Q&A data and either join a question to an existing group or create a new group.  This also means adapting coherence thresholds and even grouping methodologies to ensure results are understandable without extensive review.

Additionally, recently the field of Natural Language Processing is undergoing something of a renaissance of new methodologies.  These new developments try to extract meaning from text based on neural network technologies, rather than the traditional techniques outlined here.  One major weakness of traditional techniques is that they don't include the sequence information inherent in language. The state-of-the-art technologies create representations of text that include both individual word meanings as well as the sequence in which they appear.  We aim to leverage these technologies to extract additional groups of frequently asked questions.

Lots of exciting things coming down the pike.  Stay tuned!