Movieposter Brain that wouldn't die

Detecting Counterfeit Webshops. Part 1: Feature engineering

The number of fake webshops is rising. From 2010 to 2012 the Dutch authority on internet scams received 81.000 complaints. Spammers have moved from running their own webshops to hacking websites or registering expired domain names. This makes classification more difficult.

In this series we will experiment with machine learning to automatically classify the trustworthiness of a webshop (and by extension, any malicious website). The focus will be on fake webshops hosted on the Dutch TLD (.nl) or catering to Dutch users.

Update: New Google Research: The underground market fueling for-profit abuse.

Understanding the problem

Identifying counterfeit webshops has transformed from a spam problem into a crime, economy, health and security problem.

Consumer safety at a webshop relies on:

  • The technical safety of the webshop itself (some webshops are vulnerable to hacking, yet handle your credit card or banking information)
  • The trustworthiness of the webshop owner (does he or she deliver as promised?)
  • The safety of the payment process.

Fake online pharmacies

11% of the adult Dutch people order medicine and pills online (Abortion pills, Cialis, Viagra, Xanax). The chance of these pills being fake is over 50%. Fake pills can cause medical risks (a Dutch government organization found rest products of XTC in confiscated scam medicine).

Producers and webshop of counterfeit medicine become more professional. Often the average internet visitor can not distinguish fake from real. The payment process is smooth and the delivered packages look real and trustworthy.

When the Ministry of Health, Welfare and Sport put up a fake pharmacy on medi-plaza.nl after a few months 18.000 people had placed an order (Customers received a warning and education flyer).

Analysing a current scam

Scammers target users on marktplaats.nl (the Dutch eBay). They hack accounts from webshop owners or they convincingly copy (fork) these accounts. They then place copied advertisements or listings which redirect to a scam webshop on a domain they control. The scam webshop may copy the contact information and graphics of the victim webshop.

Then they wait for orders to come in, yet will never deliver. Money is deposited in the bank accounts of lackeys: The (often underage) lackeys receive moderate payment for providing access to their accounts, and have no clue what is really going on, until they are arrested when the trail leads to them.

Prevention guidelines

There are a few expert guidelines to manually spot fake webshops:

  • G1) Search Google for reviews and complaints in combination with the name of the webshop.
  • G2) Search the official registers to spot if the webshop is registered and if the data is correct and up-to-date.
  • G3) Check the product pricing and see if it conforms to the standard market price. Be more careful with overtly cheap pricing.
  • G4) Check the terms of service and delivery policy: Does the webshop demand that you pay more than 50% up front? Are the return policies fair? Do you get your money back in case of voiding the purchase? Do you know when the product is delivered?
  • G5) Are there spelling errors on the site?
  • G6) Is there a phone number or address to contact the site owner?
  • G7) Check if the product is legal: Can you buy the medicine without prescription? Is it legal to own that BB-gun?

We will try to capture these guidelines into features.

(Business) value of an effective classifier

  • Brand owners fighting trademark infringement.
  • Search engines cleaning their index and protecting users and webmasters.
  • Webshops cleaning their own sites from spam reviews and detecting anomalous pricing.
  • Trading platforms to detect scammers and scam websites.
  • Browser vendors to their protect users.
  • For direct consumers awareness in the form of a browser plug-in.
  • For better research into the domain and coordinated action to better solve it.

Guidelines-based Features

G1) “Query Google”-type features

As automatically quering Google is against their Terms of Service, and the Custom Search Engine API would not suffice (due to differing results and 100 queries a day), querying Google to generate features becomes hard.

In an ideal world we would generate features for the normalized number of search results for the webshop name in relation to terms like: “scam”, “not trust”, “negative review”. Also we’d prefer a better estimate of Google PageRank, links and domain authority.

Google features

None for now.

Complaint features

An alternative would be to create a list of review and complaint sites. Though more manual work with an API or scraper, this would offer the benefit of a fine-grained search on the sites, and to possibly calculate average votes.

A related problem to this is the problem of fake reviews. Long-term scamming webshops may insert fake positive reviews. These are harder for an outsider to spot (one would usually need more insider information, like IP).

An example of a complaint site with data on webshops is opgelicht.nl. Complaints range from violations of the law on personal privacy (compare CAN-SPAM act) to full scam websites using stolen credentials.

A basic binary feature would be:

complaint_site_opgelicht_nl_warns_against:1

A basic count feature for another complaint site would be:

complaint_site_opgeletopinternet_nl_warns_against:7

where 7 is the number of pages of a complaint thread. To realize this feature I wrote a web crawler with Mechanize and BeautifulSoup. Crawling and storing around 1000 (often already offline) problem URL’s.

Data from mijnpolitie.nl (The central point to report internet scams to Dutch justice) could serve well as features, but this data is not (yet) publicly accessible for automatic querying.

Review Features

A webshop being listed on a review site (even with zero reviews) should be a decent signal for its legitimacy. Most scam webshops do not last long enough to bother with reviews and review sites.

I gathered the URLs of around 10.000 unique webshops together with their rating on different sites, number of raters, and review source URL. I stored these URL’s in CSV (and the raw HTML in Gzipped files).

Honeypot Features

One trick to get a lot of web spam URL’s is to crawl the links from spam e-mails. Spam e-mails often link to spam websites (though sometimes through up to 14 redirects). Another trick is to set up (unpatched) WordPress and Joomla installs to study the trends in webshop hacks (where entire sites are replaced with a functional webshop by spammers). One can also follow the spam comments back to their spam site owners.

G2) Search the official registers

Dutch webshops are required to be registered at the Dutch Bureau of Commerce. After that they receive a unique KVK and BTW number. This is often listed on the websites, and can be manually checked (at the cost of 4 eurocents per number). This makes this feature too costly for tens of thousands of sites.

Bankruptcies are publicly published. This may provide features like “operates_web_shop_while_bankrupt” which should be very indicative of fishy behaviour.

Manufacturers often publish lists of official retailers and dealers, though all too often in the shape of some beautiful (but terribly inaccessible) JavaScript shopfinder. I am thinking of mailing the brands of often counterfeited products (fashion,handbags,sunglasses) to aquire a machine-readable list of their official retailers.

G3) Check Pricing

Another feature that is hard to compute at web scale. For this one needs to build a price watch, so one can spot anomalies like 200$ designer couches. Outside the scope of this classifier for now.

G4) Check the terms of service

Natural language processing on the terms of service would be hard, and there is no guarantee that it will be a signal (as scam webshop could easily fake their terms of service). Presence in official registers should already mean that terms of service are in order.

G5) Spelling errors on the site

Glaring spelling errors could be caught with a reasonable amount of false positives. A spell-check on the visible text will give the number of errors.

G6) Contact information on site

A parser can detect contact information like: Address, phone numbers and e-mails. There is no guarantee that this information is real, and checking legitimacy would be a hard task of cross-referencing multiple data sources.

Phone numbers and e-mails addresses can be checked for membership in scammer blacklists.

Intelligence Features

Using blacklists and alert lists one can often to a reasonable degree find even more domains owned by the same scammers. Sometimes a Reverse IP Domain Check unveils a server full of other scam webshops. Or a WHOIS search unveils a known blackhat SEO registrant with multiple unlisted domains. Or exact match searches unveil a network of copy sites. These features are manual labour intensive, though a big scam problem is often caused by a few scammers, making these features possibly valuable.

Users of the model could also provide feedback and report spam and scams. For this I created a special purpose complaint and review list of webshops. I plan to crawl a large list of webshop URL’s in the future to add to this intelligence: Adherence to Google Webmaster Guidelines, visibility of contact information and more.

This to-do list is currently nearing a 100 Dutch sites/networks selected for further analysis.

Technical features

The paper “Identifying suspicious URLs: An Application of Large-Scale Online Learning” explores multiple online machine learning approaches for detecting malicious web sites (those involved in criminal scams).

These features can work, where blacklists or G1-signals can fail (for example: when these are not up-to-date).

Lexical URL features

  • Tokenize the scheme, hostname, path, TLD and parameters (bag of words)
  • Length of hostname and path
  • Number of dots in URL (or better number of subdomains)
  • Number of ?, -, /, = and _ in URL

Host-Based features

Malicious Web sites may be hosted in less reputable hosting centers, use disreputable registars and often use relatively new domains.

For blacklists the paper mentions SpamAssassin, Botnet blacklists, and Phishtank.

The following features are suggested by the paper:

  • IP in blacklist?
  • Whois properties, like: date of registration, date of update and expiration, registrant name, registrar name, registrant location, whoisguard in place.
Registrant name: WHOISGUARD PROTECTED

Registrant city: PANAMA

Example of a suspicious WHOIS record from a scam webshop.

  • Domain name properties: TTL value for DNS records, “client”, IP, or “server” in hostname? Is there a PTR record?
  • Geographic properties: Which continent, country, city does the IP belong to? What is the connection speed?

HTML-Source features

  • External domain names inside (inline) JavaScript.
  • Presence of eval() function in JavaScript.
  • Presence of social links.
  • Tokenize visible text
  • Hashed token similarity to a set of already labeled malicious sites.
  • number of i-frames
  • full-size i-frame?
  • LDA on visible text and meta contents
  • Presence of generators (WordPress), payment methods, shopping carts and plug-ins.
  • Number of redirects
  • Cookie set?

What’s next?

Sharing the progress on Github. Improving on the crawler. Generate features from samples and create a train set. Build a Vowpal Wabbit model. See the Vowpal Wabbit malicious URL example.

Experiment with a URL-to-Crawl-to-features API vs. URL-to-Database-check-to-features API vs. an implementation of the Vowpal Wabbit model in JavaScript, and a large part of the feature generation through client-side JavaScript too.

This series will be on hold until I finish the Avito’s: the Hunt for Prohibited Content challenge on Kaggle. It should allow me some practical insights in classifying illicit content in advertisements.

Further reading and notes

Example Crawler output

>> url = "http://mlwave.com/human-ensemble-learning/"
>> soup = create_soup(url)	
>> features = create_page_features(soup,url)

>> print json.dumps(features, sort_keys=True, indent=4)

loading from cache
{
    "link": {
        "canonical": "http://mlwave.com/human-ensemble-learning/"
    }, 
    "links_external": [
        "http://arxiv.org/pdf/0911.0460.pdf", 
        "http://beatingthebenchmark.blogspot.com/", 
        "http://blog.kaggle.com/", 
		... 
        "https://twitter.com/mlwave", 
        "https://www.mturk.com/mturk/welcome", 
        "https://www.youtube.com/watch?v=sRktKszFmSk"
    ], 
    "links_internal": [
        "/#content", 
        "/#search-container", 
        "/",
        ...
        "/predicting-repeat-buyers-vowpal-wabbit/", 
        "/winning-2-kaggle-in-class-competitions-on-spam/", 
        "/wp-content/uploads/2014/07/writing-1-distort-11.png"
    ], 
    "links_javascript": [], 
    "meta": {
        "author": "", 
        "contact": "", 
        "copyright": "", 
        "description": "", 
        "generator": "WordPress 3.9.1", 
        "googlebot": "", 
        "keywords": "", 
        "language": "", 
        "robots": ""
    }, 
    "pagetitle": "Human Ensemble Learning | MLWave",
    "root_base": "http://mlwave.com",
    "text_meta": "meta charset utf 8 ... 3 9 1 name generator", 
    "text_visible": "human ensemble learning mlwave ... creative commons 3 0 attribution"
}

The intro image is from a promotional poster for the 1962 movie “The Brain That Wouldn’t Die

Leave a Reply

Your email address will not be published. Required fields are marked *