avito 8

Lessons learned from the Hunt for Prohibited Content on Kaggle

Previously we looked at detecting counterfeit webshops and feature engineering. Now we will show some progress and learn from our insights (and mistakes) competing in a related Kaggle challenge.

Vowpal Wabbit close to for the win

Kaggle hosted a contest together with Avito.ru. The task was to automatically detect illicit content in the advertisements on their site.

Many competitors were using Vowpal Wabbit for this challenge. Some aided by the benchmark from Foxtrot, others by starting out the challenge with it. The highest ranking model using VW for a base was yr‘s implementation. This #4 spot used the benchmark provided by Avito as part of the pipeline.

Our team (Jules van Ligtenberg, Phil Culliton and me, Triskelion) ended up in 8th place with an average precision of ~0.985. A team of Russian moderators had an average precision of ~0.988 when labeling the dataset. Our team did not speak Russian, just English, Dutch and MurmurHash.

It is truly amazing that so many international teams that have no knowledge of Russian language made it to the top. Ivan Guz – Competition Admin

Insights

What did work

  • Ensembling Vowpal Wabbit models. By simply averaging the ranks of different submission files one could up the score. Combining a squared, logistic and hinge loss model this way gave a score of ~0.982, while all individual models scored around ~0.977.
  • Using an illicit score. This changes the problem from classification to regression. Instead of training models on labels of [illicit, non-illicit], we used the provided “closing hours” and “is proved” variables to create an “illicit score”. The worst offenders for this model are ads that are “blocked” by a moderator, “proved” by an experienced moderator, and “closed” within minutes of being published on the site.
  • All loss functions gave good results. Initially we gravitated towards logistic loss and hinge loss. Later we added a squared loss and a quantile loss. For example averaging the ranked outputs of both a logistic and a hinge loss model, with all the parameters and data the same, gave a ~0.003 increase in score. We will study these “hybrid” loss functions better.
  • Neural networks. A functionality added to VW with the motive to win some Kaggle competitions (Thank you! One louder, indeed.). This feature gave a nice and welcome non-linear boost of ~0.001 at the end with --nn 10.
  • Reducing overfit. By averaging all best performing ensemble models we created a model with a lower score on the leaderboard, but with less variance between public and private leaderboard. We did not hit the point of diminishing returns on leaderboard improvements. During the competition 9th place was our highest position on the public leaderboard.

    Ben Hamner explains overfitting to the Kaggle leaderboard and provides some insights.

  • 2-grams on ALL the features. I have no idea why this works. It’s still a bit unclear, because this is unconventional: it makes little sense. Perhaps it works like a lesser form of quadratic features. Perhaps it works because there is text crudely mixed between the features. Perhaps VW is really THAT good in ignoring irrelevant features. Or perhaps it works because spreadsheet and dataset creators put semantically related columns close to another.
  • Having access to a fast 32GB RAM machine. One of the team members was able to quickly train and inspect Vowpal Wabbit models with a huge bitsize (2^30). Less collisions usually (but not always) works better for learning.
  • Encoding integers in categorical variables. For example: year_2009 and year:2009.

What did not (quite) work

  • Hyperparameter tuning. We did not set up a pipeline with cross-validation and model evaluation according to the competition’s metric. Parameters were tweaked with modesty, based on slightly worried hunches.
  • TF-IDF. We suspected that TF*IDF would improve the score. To fit a TF*IDF filter on both the train and test set and replace all datasets with properly namespaced features proved too cumbersome/complex.
  • Quick character encoding handling. It took me far too long to barely get this working, and then I started over, scratching the benchmark code completely, never improving on it. Turning Cyrillic characters into Latin characters did help, but is a dirty workaround.
  • Proper dataset inspection. All the column headers and variables were in a language our team did not speak. All feedback on model performance was leader board- driven. I initially missed 2.5 million lines in the train set (later more on that).
  • Bagging SVD’s. Though it could beat Avito’s own benchmark at around ~0.925, with ~0.952 these models did not contribute to the final ensemble.

Carter swamp rabbit

President Carter confessed to having limited experience with Vowpal Wabbits, preferring to stick with R.

What could have worked

  • Nearest neighbours. Alexander D’yakonov combined Nearest neighbors (120 neighbors, weights based on distance) and a basic Vowpal Wabbit model combined to rank #5.
  • Factorization machines.  Michael Jahrer and Mikhail Trofimov used factorization machines to rank over 0.98
  • SVC. The winners, Giulio and Barisumog, report using SVC successfully.
  • Random Forests. With it’s trackrecord as one of the most powerful algorithms in machine learning: RF’s working is probably a given. Our best exploratory model (useful to spot good features etc.) used sklearn’s Random Forests too, albeit for a more moderate score of ~0.805.
  • TF-IDF. Nearly everyone in the top 10 had tfidf-vectorized their datasets.
  • Using Avito’s provided benchmark. It contained both domain knowledge and a few very specific tricks in preprocessing the data.
  • Training models for each category. Trading increased complexity for increased predictive powers.

Ease of implementation

Robin Hood StatueI very much agree with FastML’s article on this competition vs. the industry. With the industry it is enough to hit the (often moving) target, and profitable to hit the bullseye. With Kaggle one is splitting arrows.

Vowpal Wabbit vs. the industry

Solutions based on Vowpal Wabbit would work well enough for Avito, or for any big moderator labeled dataset for that matter.

Though even with Vowpal Wabbit and basic techniques caution is required.

  • Using an ensemble of 10 different Vowpal Wabbit models, means running 10 instances of Vowpal Wabbit if you want a real-time prediction.
  • Train a specific model for every category and a site with 1000+ categories will go crazy.
  • TF*IDF combined with retraining on new data adds quite a preprocessing step and increased complexity.

Highly tuned single Vowpal Wabbit models approach 0.98. Averaging the outputs from two moderately inspired Vowpal Wabbit models gets one comfortably in the top 10% range and near the top 10 leaderboard.

Bag-of-features

The dataset had a column (attributes) which contained a JSON object. We really wanted to create tidy features from these, but to rely on Google Translate for feature engineering was too time-consuming. We threw everything the script could parse into one bag of “features”, mixing numerical, categorical and text features.

1 '10000074 |f category_x_transport emails_cnt:0.0 emails_cnt_x_0 avtomobil_ v ideal_nom sostoanii exclamationmark 2005 goda dekabr_ vse detali rodnye dva hozaina nikakih vlojenij ne trebuet komplektazia polnaa kondizioner gur perednie steklo pod_emniki 2 poduski frontal_nye vse rabotaet otlicno signalizazia s obratnoj svaz_u muzyka mr3 lubye proverki za vas scet exclamationmark exclamationmark exclamationmark renault logan 2005 price:205000.0 price_x_205000 phones_cnt:0.0 phones_cnt_x_0 urls_cnt:0.0 urls_cnt_x_0 ob_em_dvigatela:1.6 ob_em_dvigatela_x_1_6 model__x_logan marka_x_renault tip_dvigatela_x_benzinovyj korobka_peredac_x_mehaniceskaa probeg_x_180_000_189_999 sostoanie_x_ne_bityj rul__x_levyj tip_kuzova_x_sedan zvet_x_seryj privod_x_perednij god_vypuska:2005.0 god_vypuska_x_2005 subcategory_x_avtomobili_s_probegom

First line from Vowpal Wabbit’s test set

Using this data agnostic approach and very little to no feature engineering, one can use Vowpal Wabbit to get good scores. If you have a good moderator labeled dataset, but no good solution yet, contact me or leave a message: our team would love to keep working on such datasets.

In short, we did not treat the data or Vowpal Wabbit with much respect at all. We threw millions of men at the Wabbit and it left only a cave surrounded by bones.

Killer Rabbit Attack

How I forgot ~2.5 million rows and almost got away with it.

It took me a long time to join the competition, because I couldn’t get the benchmark running. Normally a lot of inspiration and momentum comes from running or recreating the benchmark. I quickly became team UnicodeEncodeError.

I’ve worked with European languages, which do have their fair share of diacritics and other arcane symbols, but Windows + The Python Benchmark + Russian text equalled zero for me.

When I did finally submit my first VW predictions I got a score of around ~0.971. By (incorrectly) answering a question by yr on the forums, I finally found out that the dataset when read on Windows produced around 1.5 million lines, and when read with Pandas or on other platforms would give the full size. Note to self: Keep writing files with “wb”-mode, start reading files with “rb”-mode.

Wanting to learn (without making mistakes)

According to Sergey Yurgenson there are at least three types of Kagglers:

  • Those who want to learn,
  • those who want to win money,
  • those who want to increase their reputation.

Up to this point I was clearly in the want-to-learn camp. I had nothing to lose by competing and making rookie mistakes. But now I start to feel bad when I make a fool out of myself with such basic mistakes.

I’d still gladly find these things out. If that be publicly on the forums, though a bit shameful, so be it. If I had teamed up earlier (or used more than one OS) I probably would have found this out sooner.

I realize that in Kaggle competitions one may be disrespectful of the context (domain knowledge) of the data to a degree, but one should always respect the syntax. Data inspection (measuring data quality) should be an essential part of the pipeline.

So how about those counterfeit webshops?

The problem is that I would want to get a good result, but have to create my own dataset for this. I can not remove this prior belief that Machine Learning can combat online illicit and scam content, so I am afraid I will fall prey to a subtle form of overfit.

  • Choose the best of Accuracy, error rate, (A)ROC, F1, percent improvement on the previous best, percent improvement of error rate, etc.. for your method. For bonus points, use ambiguous graphs.
  • Chose to report results on some subset of datasets where your algorithm performs well.
  • Alter the problem so that your performance improves.
  • After a dataset has been released, algorithms can be made to perform well on the dataset using a process of feedback design, indicating better performance than we might expect in the future. Some conferences have canonical datasets that have been used for a decade.

John Langford (2005) – Subtle Methods of Overfitting

I realized that what I will be making is something to solve a very specific problem: Find out how I gathered and labeled my dataset. To do this correctly I would need a way to realistically reproduce a new test set, but one that is created one week after I created my model, preferably by real-life users of the model.

In short, only a model in production can prove its worth. To get a glimpse of web-scale anti-spam measures read this inside story from the trenches by a 7-year Google engineer.

In the beginning … there was the regex. Gmail does support regex filtering but only as a last resort. It’s easy to make mistakes, like the time we accidentally blackholed email for an unfortunate Italian woman named “Oli*via Gra*dina”. Plus this technique does not internationalise, and randomising text to miss the blacklists is easy.

I’ll be on the lookout for more well-published datasets in this space, to compare my approaches with others. If your site creates a lot of data and faces a similar problem of spam and illicit content, contact me or leave a message, I’d love to chat with you.

Further reading

Ensembling code

from glob import glob
from collections import defaultdict

def kaggle_rank_avg(glob_files,loc_out):
  """
    Averaging multiple submission files for 
    Kaggle's "Hunt for Prohibited Content"
    Enter location to submission files
    Get a new submission file on loc_out
  """
  ranks = defaultdict(float)
  with open(loc_out,"wb") as outfile:
    print(glob_files)
    for i, glob_file in enumerate(glob(glob_files)):
      print(glob_file)
      for e, line in enumerate(open(glob_file,"rb")):
        if i == 0 and e == 0:
          outfile.write(line)
        if e > 0:
          ranks[line.strip()] += e
    for k in sorted(ranks, key=ranks.get):
      outfile.write("%s\n"%(k))
#kaggle_rank_avg("d:\\avito\\*.csv", "d:\\avito.ensemble.csv")

Images were from wikimedia commons (authors Soerfm, Mousse and Sven Manguard). The intro image is from a commercial from Avito.ru and the photo of president Carter refusing refuge to a Vowpal Wabbit was given to me by a man in a trench-coat inside a poorly lit parking lot.

9 thoughts on “Lessons learned from the Hunt for Prohibited Content on Kaggle”

    1. Hi! Thanks for the tip. Our team was trying LibFM for the Criteo Ad Click Prediction contest, but if we can use VW for this, then all the better. “lrqdropout” looks particularly interesting, I don’t think that is in LibFM, at least not as an option.

    2. That is cool. It looks like VW largely makes libFM obsolete? Especially since you can feed raw text through VW versus (I think) the need to encode it first prior to running libFM.

  1. Hi
    thanks for a great blog. could you elaborate a bit about how you combine multiple output files. i am not sure if i understand what ranks means and how you combine the ranks from multiple files.

    1. The competition evaluation metric was Average Precision @ K. The format of the submission was a single column with ID’s, the most illicit content ranked highest (higher up in the submission file).

      ID
      11199
      74931
      11100
      11101

      Combined with:

      ID
      74931
      11100
      11101
      11199

      Makes something like:

      ID
      74931
      11100
      11199
      11101

      ID’s are ranked according to their probability of being illicit content.

Leave a Reply

Your email address will not be published. Required fields are marked *