Last October I took part in a text mining competition hosted by CrowdAnalytix. In total 173 people signed up to the challenge. This is a write-up of the approach that I took to finish on the 1st place on the leaderboard.
What makes my approach stand out? It uses machine learning to let the computer detect specific text formats pretty much just like a human person would.
In order to make this article accessible to non-data-scientists too, I did my best to keep the technical jargon to a minimum. Let me know in the comments if this turned out well or not.
For an online retailer, offering correct information to potential customers is of utmost importance. Personally, I have plenty of times decided to not purchase a product, or to purchase it elsewhere, simply because the store wasn’t displaying enough, or unclear information for the product I was interested in.
One of the major things to get right for the online store is the Stock Keeping Unit (SKU). Simply said, it is a unique identifier for a product in a certain quantity and packaging. Basically, if you as an online store are displaying a product with SKU #AZY598, your website visitors can be confident that it is the same one as #AZY598 on the website of the product manufacturer.
The challenge for large distributors is that they receive the product data from hundreds of different manufacturers and distributors. Worst of all, each manufacturer has their own format for their SKU’s. Some use strictly uppercase characters with dashes in between, some use a combination of digits, and so on. On top of that, many suppliers do not provide the SKU separately, but rather mention the SKU somewhere in the product title or description.
All in all, there are an endless amount of SKU formats, and frequently the SKU is hidden in several other fields. So not surprisingly, for a distributor it is a serious challenge to simply obtain a list with all SKU’s for all the products of all suppliers. It is usually a manual and tedious task that has to be done by people. It is costly and slow. Hence the problem statement: “Find a way to automate the extraction of SKU’s from product titles and descriptions, and do it as precise as possible.”
The training data made available to participants contained about 54000 rows of a wide variety of products from different brands and suppliers. For every product, a title, a description and the correct SKU were provided. Here are a few examples that will give you an idea of the product variety:
chairs, toothpaste, door knobs, kitchen sinks, headphones, umbrellas, lighting kits, ink cartridges, bearings, computer hardware and software, power tools, cables, ….
A test data set of 23000 products was made available as well. This time, the correct SKU’s were hidden. Basically the participant’s job was to write code to automatically detect these. Crowdanalytix of course knew the correct SKU’s and used these to score all incoming submissions of participants.
The tables below give an impression of what the data sets looked like (displayed items are not from the original data set and are made up by myself; yes I’m also proud of uploading these tables as PNG files).
My initial approach, which is probably what most participants used, was to look through the title and description fields for each product, and subjecting both of these fields to several different Regular Expressions. In this way I would try to extract all text that seemed to look like an SKU. I.e. I wrote some code to specifically look for words containing numbers, dashes, forward slashes, periods, capitals, and so on. I would then extract these types of words for every product and put them in a list.
Once that was done, I’d walk through all the matches I found per product and choose the best one as my ‘prediction’ for the SKU. On my local holdout set I got an accuracy of about 80%. At the time that would have put me somewhere around 10th position on the public leaderboard. I wasn’t really happy with that and started thinking of a better approach to the problem. I realized the current approach was less than ideal for the following reasons:
Problem A: Sometimes not capturing all potential SKU’s
I would only extract the text from the title and descriptions that matched my manually coded regular expressions (basically hoping these would be SKU’s). I wrote lots of regular expressions to do the capturing in different ways, and yet I would encounter products where manual inspection showed an SKU was clearly present, but my code was simply unable to capture it. So in this case I would be unable to make any prediction at all!
The solution to this would be to write more regular expressions to capture a wider variety of SKU’s. But that would be a lot of manual, repetitive work, and that’s something I really avoid doing at all cost :). On top of that, since all of these regular expressions would be hard-coded, it means that it is possible that it won’t perform as well on potential future products or new suppliers.
Problem B: When multiple matches are found for a product, then which one to choose?
For some products, the regex’s would return several matches. Then I’d have to choose one of them to use as my prediction for correct SKU. So, which one to choose?
I wrote a manual decision tree based on arbitrary rules I created myself. As a quick example, I would prefer longer SKU’s over shorter SKU’s, and if there were 2 SKU’s with the same length then I’d choose the one with the most dashes or periods in it.
But my arbitrary choices were not necessarily the best ones. Additionally they would likely need adjustment if new supplier and/or product data would be added in the future. What if I could train a model and let the computer determine the most optimal way by itself?
Introducing machine learning
I then came up with a way that would solve both of the above problems. I was basically going to train a model that will automatically learn what a ‘correct’ SKU looks like. After training I’d then feed the model a list of words and it would be able to output a score for ‘how much’ every word looks like an SKU. I figured that in most cases, the word with the highest score would be the SKU we’re looking for.
Seriously, just capture every word!
So first of all in order to solve problem A, instead of extracting only the words that looked like an SKU from the product title and description, I’d simply capture every single word in them. I used a white-space as a word delimiter and that formed the base of it. I also accounted for things such as periods and exclamation marks at the end of a sentence, and words surrounded by brackets.
So after doing this then for every product I ended up with a long list of words. Naturally each list included lots of meaningless words, but most importantly it would pretty much always contain the SKU we’re looking for.
Here’s what that looked like for the first row from the train data:
AMD, Ryzen, 7, 1800X, Processor, YD180XBCAEWOF, Max, Turbo, Frequency, 4.00, Ghz, 3.6, Ghz, Clock, Speed, 8, Cores/16, Threads, UNLOCKED, Cache, 4, MB/16, L2/L3, Socket, Type, AM4, Extended, Frequency, Range, XFR
Train a model so it learns what an SKU looks like
Time to tackle problem B. How would I let my code automatically pick the SKU from that word list? I decided to calculate simple statistics for every extracted word, such as the length of the word, the amount of dashes it contained, the percentage of capitals, whether the first and last characters are a digit or a letter, and so on. In total I generated about 20 different statistics or ‘features’ for every extracted word.
I also assigned a binary variable to each and every word. Whenever a word equaled the correct SKU, I’d mark it with a 1. All the others became a 0.
I then fed the table with all the word-statistics into a classifier and used the binary variable as the dependent. This way the model could learn what an SKU looks like.
After training the model, the most important features seemed to be the percent of digits in the word, length of the word, percent uppercase letters in the word, percent lowercase letters in the word, and percent of strange symbols in the word.
I could now throw a list of random words at the model, and for every word it would tell me the probability of that word being an SKU. Mind you that the rules this model is based on is very similar to the ‘rules’ a human person uses to detect an SKU.
I checked the prediction score on some local data, and the model could tell with a 99.49% accuracy whether any randomly extracted word is the ‘correct’ SKU or not. That may seem impressive but please note that I was also throwing standard and useless words like ‘the’ and ‘screen’ at the model.
The new method seemed to work very well on my local holdout set, so I then ran the trained model on the official test set and submitted my predictions to the leaderboard. I took the first spot straight away, beating the 2nd best submission with a comfortable margin.
In the end I did some small further optimizations to improve the score slightly more until the submission deadline was reached. I ended up with a public leaderboard score of 87.755%. As I expected, it translated very well to unseen data and resulted in a private leaderboard score of 87.198%. Good enough to finish in the top spot.
When I started this text mining project I had absolutely no idea that I was going to end up using a machine learning approach. I started out just extracting SKU’s from the data with regular expressions. This was certainly the straight-forward approach, but quite labor intensive since lots of different regular expressions would have to be manually coded.
I then changed my approach and threw machine learning into the mix. Basically I first trained a classification model so it learns what an SKU looks like. When done, the model could tell me the probability of a particular word being an SKU or not. The internal structure of the model actually gets quite close to how a human person would do the same task. Although it doesn’t quite get the same accuracy as a human would, it does it at least a million times faster. All in all a very value outcome that can lead to serious cost savings for a business!
In this particular text mining project, the machine learning approach provided the following advantages over the more straight-forward regular expressions approach:
- Slightly higher accuracy (at least in this project)
- Can be easily retrained when new products and suppliers are introduced.
- The classifier threshold is available so that the client can inspect how sure the model is of each prediction.
- A classifier threshold cutoff can be set by the client in order to output a ‘None’ prediction for predictions with high uncertainty.
- Virtually no risk of failing to extract some SKU’s.
- No need to define arbitrary decisions.
If you have any questions, please drop me a message below in the comments. I’ll do my best to answer as well as possible.