How to choose the right model for text classification in an organizational setting

Notes from Industry

On the importance of understanding and disregarding technical considerations in applied machine learning, or, When to use your putter to get out of a sand trap.

The typical response to an errant golf ball landing in a sand trap is to swear, sigh, and then trudge over to the sand trap with a sand wedge. It’s not a tough decision. Every golf bag has a sand wedge, and unlike all of the other decisions about which club to use – should I use a 5 iron? a hybrid? ironwood? a 6 iron? – the sand trap has a club designed, and named, specifically for it. So you might be surprised to learn that there are serious golf professionals who advice that, whenever possible, you should use a putter in a sand trap. Why? If there is a better tool, why recommend a putter? Because the best tool is not the one that most perfectly fits the job, it’s the tool that you’re best at using for the job.

If you’ve worked on the technical side of software, you’ve probably experienced an older engineer giddily telling you about some tool they wrote in awk 15 years ago that perfectly solves some problem you have. Knowing the right tool for the right job is why management consultants get paid so much to come up with plans. Management consultant services aren’t valuable because they’re the only people smart enough to solve hard problems, they’re valuable because they’ve seen a lot of similar problems, and they are smart enough to have learned from them. Your company might get acquired once, but the Big Three have literally centuries of combined M&A experience. The old engineer’s (the _awk_tegenarian’s) advantage is not that he’s fluent in command line tools, or that he’s smarter than a junior engineer, it that he has seen this problem before.

But is using awk a good solution? The answer depends little on the performance of the awktegenarian’s program, and more on current organizational needs and capacities. If he leaves, or somebody else needs to maintain or modify the program, or the program needs to be explained to an executive, using awk (an ancient command line language) is worse than a spreadsheet, even if the spreadsheet is slower, cannot handle as much data, and requires that you manually edit the data for twenty minutes each time before you can use it.

At Salesforce my team develops and deploys models for a variety of NLP tasks, but our bread-and-butter work is text classification models for emails. Fitting a model is doubled: you fit your model to your data and you fit your model to your organization. In this post I’m going to walk through an example text classification problem to illustrate some of the insights we’ve learned in choosing the right model for the problem, from a technical data science perspective. And then I’m going to show how you throw that all out the window to fit your organizational constraints.

From a technical perspective, what model we choose ultimately depends on how well our training data captures the diversity of the data it’ll score in production. Salesforce has a massive diversity of customer organizations. There are organizations with two people and organizations with tens of thousands. They sell units, packages, licenses, and engagements. Some use a single, simple sales process, others have dozens of different pipelines for a diversity of products. Some companies are more likely than others companies to allow us to train models on their data. So when we create datasets we always know a priori our training data is both incomplete and biased, which is the primary thing we think about in choosing a model. Regardless of the model we choose for the task at hand, it needs to be deployed, scaled up, and maintained for the lifetime of the product at a reasonable cost. Salespeople need to be able to demo it to customers, and product managers need to be able to demo it to executives. These organizational needs typically outweigh technical ones.

Task and Corpora:

The example task for this blog post is to classify a sentence as a question/not a question. I chose this because it is a simple, well-defined task and we mostly agree about what a question is. Most interesting text classification tasks will involve a great deal of effort haggling about the definition, such as whether "Let’s get lunch sometime" is or is not a scheduling request.

For a corpus, I’m using a dataset prepared by Jack Kingsman. This dataset is particularly useful for our purposes because it contains a lot of questions (33.81% of sentences), the questions are easy to label (none are missing the question mark), and many of the questions (71.94% of questions) are ungrammatical, rhetorical questions, which gives this problem its substance. (Since I’m using this corpus for research purposes, the specific contents of the dataset do not matter beyond there being numerous questions with peculiar grammar.) There are three basic types of questions. First are wh-questions that use the Five Ws (19.55%), for example "What really happened when the wizards and warlocks revealed what they had?" Next are yes-no questions that yield a yes or no response (8.51%), like "Was it really from Yemen?" or "WILL YOU ANSWER?" Finally we have the ungrammatical-rhetorical questions (71.94%), like "Date posted (early)?" or "Coord w/ foreign actors for payment/money disperse?"

Since all questions in this dataset contain a question mark, I label each sentence as a question/not a question by simple presence of a terminal question mark. I then split the dataset 80/20 for training and testing. Since the primary distinction for determining what modeling approach best fits the data boils down to "is your problem complicated and how complete is your training data?" I manipulate the dataset to create different but analogous contexts in which to see how well the training data captures the diversity of the test data. In the first dataset, the test set is untouched. In the second dataset, I remove all of the question marks from the test data. In the third dataset, I remove the question marks and mask the Five W’s with an unknown token ("XXXXX") in the test data. In the fourth dataset, I test only on yes-no questions (and non-questions), and train without any yes-no questions. I also remove the question marks from the test data. In the fifth dataset, I take the fourth dataset, but remove the question marks from the training set as well. The table below describes the datasets and their differences.

Bottom line up front:

My final recommendation for fitting a model to your data is pretty simple. Use Regular Expressions if your classes are tightly coupled to a small set of features. If separating your classes requires more complicated rules and your training data captures the diversity of your test data, then use traditional Machine Learning, and as last resort, use deep learning. Of course, "capturing the diversity" of the data isn’t a precise or quantifiable metric, so I’ll walk through the different datasets and show where a Random Forest is close enough (Datasets 1, 2, 3, and 5), and one where it isn’t (Dataset 4). This is summed up in the following flowchart:

It’s hard (read: impossible) to know, a priori, the answers to these questions without exploring the data. In a sense you use Regular Expressions to answer if your classes can be divided with a small set of features and so should use Regular Expressions, and you use machine learning to figure out if your training data captures the diversity of the data in the wild and so should use machine learning. The challenge here is not technical, but emotional. Once you create a model, especially one you worked hard on, it is difficult to delete it. But getting precious about your models prevents you from doing the right thing. Be willing to create models, and to do so carefully and thoughtfully, and then to throw them out. You must be willing to murder your darlings.

I’m in a way short circuiting the need to explore to learn about the data because we already know a lot about questions without exploring this data, wh-question words form a closed class, and the syntax of questions appear rarely outside of questions. This is a bit like starting a project by finding out that it is surprisingly well behaved after exploring it for a month, but it makes for a better blog post because we can go straight to the meat of the problem.

Start with Regular Expressions

If you can divide the classes with a small set of features, RegEx is the best way forward. If every question contains a terminal question mark, and terminal question marks appear nowhere else in the text, then using RegEx will be perfect. I could have just looked for the presence of a terminal question mark, but that’s an unrealistically bad strawman. Remember, we already know a lot about questions, and if we didn’t we would have explored the data until we did, and we know that people don’t always use question marks (eg in casual conversation, chat, grammatical errors, etc.) and that questions tend to use words from the Five W’s. So I made two RegEx classifiers: one that just looks for terminal question marks and the Five W’s, which I call the Brittle RegEx Classifier, and one that does the same but additionally looks for the sentence starting patterns of yes-no questions, which I call the Robust RegEx Classifier. The additional patterns beyond the presence of a terminal question mark improve recall at the expense of precision. For example, the sentence "We know what this means." **** is a positive match for the Five W’s pattern, but is not a question and "Is this about the virus or something else" will produce a positive match for the Robus RegEx Classifier but not the Brittle one. The two RegEx classifiers illustrate the value of understanding your data.

On Dataset 1, where the test data still has question marks, both RegEx classifiers perform extremely well. On Dataset 2, where test cases are missing question marks, the recall and precision for both classifiers drops, but not to zero, thanks to the additional features beyond a terminal question mark. On Dataset 3, where test cases are missing question marks and question words are masked, the two models diverge in performance. The Robust RegEx Classifier’s recall takes a further drop, since only the yes-no question pattern will find questions now, and yes-no questions are a minority of questions, but the Brittle RegEx Classifier’s performance drops all the way to zero. On Datasets 4 and 5 the Robust RegEx Classifier has excellent performance again.

How could we improve the performance of the Robust RegEx Classifier’s performance on Datasets 1, 4, and 5, where it did well? Since the recall is nearly perfect, we want to improve precision, which is to say, find a way to reduce false positives. The first approach to doing so is to add exceptions to our pattern. I advise against this.

As you add exceptions, you’ll start to notice that some exceptions require further exceptions, and so forth. If you keep going down this road, you’re going to accumulate crippling technical debt. One of my favorite papers is a dissertation primarily about memory allocation for a program that does automatic hyphenation. The algorithm hyphenates via rules, exceptions, and exceptions-to-exceptions, about five layers deep. This kind of nested if-statement algorithm can be very powerful, and very fast, but it is notoriously difficult to debug or understand. Moreover, it’s easy to add rules for new cases and inadvertently make your performance worse without understanding why. RegExes have the added bonus of being significantly easier to write than they are to read, so as you create more exceptions, and more nuanced and detailed RegExes, your system becomes harder to understand, harder to debug, and harder to adapt to new cases.

The RegEx classifier’s performance plummeted on Datasets 2 and 3, and for good reason: the features picked out by our RegEx no longer correlated with the classes we wish to separate. So what is there to do about this? We would need to look at the data, and create more patterns that we think capture the difference, but once we start going too far and the patterns become unmanageable (a subjective assessment), it’s time to label data and let the data determine how to map features to classes. Even the best Regular Expressions sometimes need to be deleted. This brings us to the next approach: machine learning.

Develop an ML Model

Once you start down either the rules and exceptions and exceptions-to-your-exceptions path or the hand-tuned-features-and-class-separation path, it is time for a machine learned approach. If you find yourself trying to implement three level lexicographical ordering rules, you have made a mistake much earlier that led you to this point in your life. You need to have labeled data for a machine learned approach, and if you started making sufficiently complicated RexEx rules, you probably had a decent set of sample data anyway. Let’s see how a Random Forest model performed on our five datasets.

The intuition for the ML model’s performance is much simpler than that for the RegEx classifiers: the model performs better when the training and test data are more similar. Datasets 1, 2, and 3 all actually have the same training data, and deviate only on their test data, which sequentially moves further and further from the training set.

Removing question marks from the test set (Dataset 2) causes an appreciable dip in performance because the primary feature the model focuses on is the question mark. In fact, Question mark and period are the two most important features and by an order of magnitude. I chose a Random Forest model for two reasons: it performs well in a variety of contexts, and it provides features importances by measuring the out-of-bag error for each feature. Feature importances are, like many things, "kind of like a probability" but not quite – but for each feature the importance is a number between zero and one and they all add up to one. For a feature to be important, it must be commonly chosen to branch the little decision trees in our random forest, which means that it both powerfully separates the classes, and occurs relatively frequently. If your most powerful features don’t occur frequently (or at all) in your test set, performance drops. Below is a table of feature importances for a Random Forest model trained on each of the five datasets.

All models need the training data to resemble the data in the wild in some important way, and in this case important means in the way in which they are featurized. There are two ways to visualize this relationship. The first is to look at how deep down the list of less important features you have to go until you have features that appear in the majority of your test cases.

This graph is easier to read but it disguises that as you dig deeper into the features, they’re each less important, that is, they appear in more of your test cases, but they’re not very helpful features for separating your classes. Our features are subject to Zipf’s Law, which means that that term frequency is roughly inversely proportional to its rank in a frequency table. Discriminative features are very likely to be uncommon. The question identification task sidesteps some since the Five W’s and question marks are a closed set and appear relatively frequently, but nonetheless the lower ranked features are significantly less frequently. To try to capture this phenomenon, I’ve made another, harder to read plot, below:

The x-axis is the cumulative feature importance, and the y-axis is the percentage of test cases that contain any of the n-grams. For example, in Dataset 1 the first two features (‘?" and ‘.’) account for a little over 40% of the importance, and appear in almost every single test case. The ML model does well on Dataset 1 because the most important features frequently appear in test cases. Dataset 4, on the other hand, illustrates nearly the opposite: most important features do not show up nearly as many of the test cases, and that percentage starts to climb only as we get into increasingly less useful features.

Is this anything you could use, a priori, to determine that your training and test data are not similar enough? Probably not. You had to develop the model carefully and earnestly to determine that the model won’t do, and then you have to murder your darling model. Every time you have to get into the data. Hopefully this section helps build your intuition for understanding the interaction between what the model is learning from the featurization of the training data, and how readily it can apply that understanding to the test data.

One way to close the gap is to remove information from your training data to prevent overfitting on certain features that appear infrequently (or never) in your scoring data. Dataset 5 is identical to Dataset 4, except without any question marks in the training data. Performance improves dramatically and question marks disappear from the important features list. If you cannot modify your data, try to find more data sources, ideally ones that are directly from or at represent different dimensions of the data you’ll actually be scoring. Reach out to other people on your team or nearby teams and see if they can think up any examples that might be tricky for your model to handle. People on other teams love coming up with examples that show that they are smarter than your models. I found a surprising depth of counterexamples my coworkers can think up when I made our models available in a SlackBot (more about that later). Before you brush off TensorFlow, try to find more training data. If this is all you can do, then it’s time for deep learning.

Deep Learning as Last Resort

You’ve concluded that your training data doesn’t capture the data that you’ll see in the wild. There are a variety of real, excusable reasons as to why you’ve ended up in this position. Perhaps mistakes were made gathering the data. Perhaps your MSA only covers some of your users, so you get a biased sample from opt-in data. Perhaps you purchased an analogous dataset, but which is outdated, or unworkably idiosyncratic. Whatever the reason for the belief that your training data do not cover the data in the wild, and that you cannot augment or fix your datasets, it’s time for deep learning.

Deep learning is very much the rage, and I am far from an expert on it. Deep learning seems to handle a variety of tasks robustly, granted that you have enough data to allow it to optimize a bazillion parameters. I tried a simple Bidirectional LSTM model, and it does a pretty good job.

More importantly, it performs robustly across all five datasets. So why don’t I recommend that you just use deep learning every time and call it a day?

If you’ll recall, this blog post is about how to choose the right model. Part of that means choosing the model that fits your data. The other part of that is choosing a model that fits your organization. Regular Expressions have a lot of advantages in an organizational setting. They don’t require GPU’s to train and they can be deployed as part of whatever code you’re already deploying. If your executives or managers insist that a specific example works, you can make sure it works. And if we’re being brutally honest, Regular Expressions hardly only need a handful of positive examples for labeled data. Deep learning models need GPUs, tons of labeled data, and ample time to explore an enormous hyperparameter space. If you train a deep learning model, but your company has never deployed one before, or you yourself are going to be the one to deploy it, this might cost your project months or years of time. You might be better off with a garden variety ML model with a worse F1 score, but which can be deployed and maintained by other people in a timely manner.

If I’m going to tell you that the right model is one that you can deploy with the resources you have, trained with the data you can get, in the amount of time that your boss has patience for, why did I go through all of this detailed technical analysis? If you try to fit a model to the organization without understanding how to first fit it to the data, you will not succeed. I’ll share some examples from when I designed the signature parser model for Salesforce.

I had three organizational restraints that I needed my model to fit. The first is that the team that deployed the model wrote jobs for Kafka Streams in Kotlin. So I needed to package my model as a library deployable as a JAR. This meant that the previous two machine learning frameworks my team has used – MLlib and TensorFlow were out. I ended up using SMILE (written by Haifeng Li) because it provided the minimum machine learning functionality in Scala, and implemented the missing tools for my data science workflow myself. The second constraint was that a signature parser needs to be trained on Personally Identifiable Information (PII), but we do not use PII in customer data at Salesforce. Since a signature is PII, I ended up fabricating training and test data to have statistical resemblance to our customers’ data, but without actually using their data. I wrote about data fabrication here. Finally, I made the title scoring model internally available as a Slack Bot, and discovered that the product managers had a couple of titles that they just needed to see it score correctly. Theoretically, the correct way to solve this is to go through the laborious process of labeling more data and retraining the model, but that’s time consuming and there are a dozen other projects that more badly needed my attention, so I created the Otherwise Model. I put a simple RegEx in front of the model, which handled the special cases they needed, otherwise my ML model did the work.

PS If you’re wondering, here is a graph comparing the F1 scores of each model against each dataset. Notice that nowhere here does it mention that spinning up new infrastructure and components to support TensorFlow at large scale is non-trivial.