old computer room

Online Learning Perceptron

Let’s take a look at the perceptron: the simplest artificial neuron. This article goes from a concept devised in 1943 to a Kaggle competition in 2015. It shows that a single artificial neuron can get 0.95 AUC on an NLP sentiment analysis task (predicting if a movie review is positive or negative).

In logic there are no morals. Everyone is at liberty to build up his own logic, i.e., his own form of language, as he wishes. – Rudolf Carnap (1934) “Logical Syntax of Language”

McCulloch-Pitts Neuron

The birth of artificial neural nets started with the 1943 paper “a
Logical Calculus of the Ideas Immanent in Nervous Activity”
. Two researchers, McCulloch a neurologist, Pitts a logician, joined forces to sketch out the first artificial neurons.

McCulloch reasoned that nervous activity had an ‘all-or-nothing’ activity: A neuron would fire (output “1”) once it’s activation threshold was reached, or it wouldn’t fire (output “0”).
Pitts saw and understood the potential to capture propositional logic using such neurological principles.

mccullog pitts neuron

Schematics of neurons, looping neurons and nets from the McCulloch-Pitts paper.

An example

A bird will try to eat brown round objects, such as seeds. We can represent this behaviour as:

Object Brown? Round? Eat?
Seed 1 1 1
Leaf 1 0 0
Golf Ball 0 1 0
Key 0 0 0

One could model the above table with a McCulloch-Pitts Neuron. If you set the activation threshold to 1.5, the neuron would only fire when both “brown” AND “round” properties are satisfied. Then the sum of the inputs is 2 (1+1), which is larger than the activation threshold of 1.5.

Pitts Neuron

As a condition for the use of logical inferences and the performance of logical operations, something must already be given to our faculty of representation, certain extralogical concrete objects that are intuitively present as immediate experience prior to all thought. – Hilbert (1926)

Perceptron

Invented in 1957 by cognitive psychologist Frank Rosenblatt, the perceptron algorithm was the first artificial neural net implemented in hardware.

In 1960 researchers at Cornell Aeronautical Laboratory, with funding from the US Office of Naval Research, randomly hooked 400 photocells to a perceptron and the “Mark 1 perceptron” was born. It was capable of basic image recognition.

Weights

A Perceptron works by assigning weights to incoming connections. With the McCulloch-Pitts Neuron we took the sum of the values from the incoming connections, then looked if it was over or below a certain threshold. With the Perceptron we instead take the dotproduct. We multiply each incoming value with a weight and take the sum: (value1 * weight1) + (value2 * weight2) etc.

Perceptron mark 1

Learning

A perceptron is a supervised classifier. It learn by first making a prediction: Is the dotproduct over or below the threshold? If it over the threshold it predicts a “1”, if it is below threshold it predicts a “0”.

Then the perceptron looks at the label of the sample. If the prediction was correct, then the error is “0”, and it leaves the weights alone. If the prediction was wrong, the error is either “-1” or “1” and the perceptron will update the weights like:

weights[feature_index] += learning_rate * error * feature_value

Above code corrected: Thank you leakymirror.

Example

Here a perceptron learns to model a NAND function without any errors:

"""
Creative Commons A-SA, source: http://en.wikipedia.org/wiki/Perceptron
"""
threshold = 0.5
learning_rate = 0.1
weights = [0, 0, 0]
training_set = [((1, 0, 0), 1), ((1, 0, 1), 1), ((1, 1, 0), 1), ((1, 1, 1), 0)]
 
def dot_product(values, weights):
  return sum(v * w for v, w in zip(values, weights))
 
while True:
  print('-' * 60)
  error_count = 0
  for input_vector, desired_output in training_set:
    print(weights)
    result = dot_product(input_vector, weights) > threshold
    error = desired_output - result
    if error != 0:
      error_count += 1
      for index, value in enumerate(input_vector):
        weights[index] += learning_rate * error * value
  if error_count == 0:
    break

And there we have it, a simple linear classifying single node neural net: the perceptron.

The embryo of an electronic computer that the Navy expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence. – The New York Times reporting on the Perceptron (1958)

Online Learning Perceptron in Python

We are going to implement the above Perceptron algorithm in Python. We use only standard libraries so the script will run on PyPy (3-4 speedups), taking massive inspiration from tinrtgu’s online logistic regression script first seen on the Kaggle forums: “Beat the benchmark with less than 200mb of memory.”. You can find our script at the MLWave Github repo.

Online learning

The perceptron is capable of online learning (learning from samples one at a time). This is useful for larger datasets since you do not need entire datasets in memory.

Now to steal some feature we’ve seen in our favorite online machine learning tool — Vowpal Wabbit:

Hashing trick

The vectorizing hashing trick originated with Vowpal Wabbit (John Langford). This trick sets number of incoming connections to the perceptron to a fixed size. We hash all the raw features to a number lower than the fixed size.

Let’s take a sample like:

This movie sucks

it may be vectorized with hashing as:

sample = "This movie sucks"
fixed_size = 1024
features = [(hash(f)%fixed_size,1) for f in sample.split()]
print features

To get the sparse (all non-zero elements) format:

# list of tuples in form (feature_index,feature_value)
>> [(712, 1), (169, 1), (364, 1)]

Progressive Validation Loss

Learning from samples one at a time also gives us progressive training loss. When the model encounters a sample it first makes a prediction without looking at the target.

Then we compare this prediction with the target label and calculate an error rate. A low error rate tells us we are close to a good model.

hal eye

“No 9000 computer has ever made a mistake or distorted information. We are all, by any practical definition of the words, foolproof and incapable of error.” — HAL 9000 (1968)

Normalization

Vowpal Wabbit has online feature value normalization. We use a simple log(feature value + 1) for now and politely bow at the master’s magnificence.

Multiple passes

With Vowpal Wabbit you can run multiple passes over the dataset with a decaying learning rate.

We can also run through datasets multiple times for as long as the error rate gets lower and lower. If the training data is linearly separable then a perceptron should, in practice, converge to 0 errors on the train set.

“By the end, if you make it that far, you’ll be hoping that the rabbit completes its mission.” – User Review for the movie “The Nasty Rabbit” (1964)

Kaggle Results

When bag of words meets bags of popcorn

This is a knowledge and tutorial contest on Kaggle. It uses a movie review dataset similar to the one in the paper “Learning Word Vectors for Sentiment Analysis“.

The task is to predict if a review has an accompanying rating less than 5 (negative), or over 7 (positive) for 25k test samples. We have a labeled train dataset of 25k samples to build our model.

We write a small function to run through the .TSV datasets and yield the label, identifier, hashed features and feature values.

Top benchmark

The current best benchmark on the forums is a logistic regression script from Abhishek. It does an in-memory tfidf-fit_transform on both the train and the test set to get 0.95 AUC on the leaderboard. We skip that transform and stick to our online hashing vectorizer.

We do generate 2-grams (such as “not good”) from the features, much like the script from Abhishek does. We simply hash these 2-grams and add them to the sample vectors.

We also quickly clean the text a little by lowercasing and removing anything not alphanumeric.

Predictions

To optimize for AUC (the competition metric) we do not submit only “1” or “0” predictions. This would give us an AUC of around 0.88. We submit the dotproduct normalized between 0 and 1.

Our script matches the solution from Abhishek with a score of 0.952. But it takes just a few MB of memory, doesn’t do tfidf transform and converges to 0 errors on train set in just over 2 minutes.

Without cleaning and without 2grams the script runs in less than 60 seconds and produces a score of 0.93.

Pass Errors    Average     Nr. Samples     Since Start
1    5672      0.77312     25000           0:00:04.353000
2    3110      0.8756      25000           0:00:08.464000
3    2280      0.9088      25000           0:00:12.482000
4    1659      0.93364     25000           0:00:16.485000
...
19   138       0.99448     25000           0:01:15.814000
20   110       0.9956      25000           0:01:19.922000
21   83        0.99668     25000           0:01:23.843000
...
29   40        0.9984      25000           0:01:54.922000
30   36        0.99856     25000           0:01:58.802000
31   17        0.99932     25000           0:02:02.607000
32   2         0.99992     25000           0:02:06.521000
33   0         1.0         25000           0:02:10.382000
Zero errors found during training, halting

Testing
Errors    Average    Nr. Samples     Since Start
12431     0.50276    25000           0:00:03.938000
Done testing in 0:00:03.981000

“A recurrent neural works, made his idea to the mentally crucifix ‘Cato Green Ronnulae Schill'” — Ilya Sutskever’s text generating RNN

ADCG SS14 Challenge 02 – Spam Mails Detection

I already wrote about the spam mail detection challenge before. I gave a lot of credit to the co-competitors who had to write their own classifiers. I am proud to be able to say I now did this competition with code I wrote myself too.

The top result for this competition was my own Vowpal Wabbit solution at 0.998 accuracy. I wrote a small parser to yield the label, id, and features from my old VW datasets.

This online learning perceptron script is able to exactly match the Vowpal Wabbit solution (0.99836). Perhaps we have a light-weight Perceptron script that is closer to a Vowpal Wabbit with hinge loss than we dare to think.

A simple question: “What exactly is information and data communication?” may be answered using two entirely different theories. The conventional Shannon information theory regards all information as “quantities”, which are transmitted using bits and bytes in meaningless bit streams. This outdated theory is still being taught at our universities, even though no biological creature communicates in this way. A biologically correct Autosophy information theory, in contrast, regards all information as “addresses”, which are communicated to create new knowledge in a receiver. — Klaus Holtz (1987)

Did not work

All challenges where the data is not pre-normalized between 0-1 gave poor results. This Perceptron script obviously likes sparse one-hot-encoded data, such as text. Also regression and multi-class classification is not implemented yet.

Code

You can find the script used for the movie review challenge at the MLWave Github Repo. Feedback, like adaptive normalized invariance implementations always welcome ;).

Conclusion

“For me there is no absolute knowledge: everything goes only by probability” – Gödel (1961)

A very basic perceptron with 0-1 threshold activation can do well in NLP contests. We did not change the 1957 algorithm that much at all. We could do away with the hashing trick and receive similar or slightly better scores.

This script can not output a probability. One could use a bounded sigmoid on the dotproduct to get a form of online normalization of the output or better study tinrtgu’s script.

Since a single-node single-layer Perceptron can not model non-linear functions, to get better NLP results we will have go deeper and look at neural network structures like feedforward nets, recurrent nets, self-organizing maps, Multi-layer Perceptrons, word2vec and extreme learning machines (fast ffnets without backprop).

Update: Further Reading

Artificial Neurons and Single-Layer Neural Networks – How Machine Learning Algorithms Work Part 1 Thank you Sebastian Raschka

The Man Who Tried to Redeem the World with Logic – Walter Pitts rose from the streets to MIT, but couldn’t escape himself. Thank you Stephen Oates

34 thoughts on “Online Learning Perceptron”

  1. Hi, thank you for this post!

    I would like to know why the results are good if the testing set has about 50% of errors. 12400+ reviews are missclassified.

    Thanks!

    1. Hi Luis,

      I set dummy (fake) labels for the test set. All labels for the test set are “1”. The 50% error rate then tells us that we have predicted a “0” about 50% of the time. With a good classifier this tells us a little about the distribution of the test set.

      When I create a 5k holdout set from the train set, and use the remaining 20k to train and the holdout set to test, then it will give accuracy of around 0.88. Because then it has the real target labels and can calculate the real error rate.

      Vowpal Wabbit used to do much the same: When testing you use a test set with all “1” labels. Then the average loss is still being reported. You can use this number to see if your model has changed (predicts more 1s or less 1s).

    2. the error you see is misleading. all the labels of testData are initialized with 1. since around 50% of them are not 1, you see the error of 50%

      Better: Split the train data to 20K + 5K. Train on 20K and test it with 5K. You would see the accuracy of around 86%

  2. I found one possible typo, which was breaking my code 🙂

    weights[feature_index] = learning_rate * error * feature_value

    should be expressed as

    weights[feature_index] += learning_rate * error * feature_value

  3. Hi,

    first of all: awesome post!

    I have two questions regarding the perceptron:

    1) Why do you want to do feature normalisation? The values of your features are always one; so what is there to normalize?

    2) In your implementation you use the log transformed feature value ( log(1.+value) ) in the learning rule for updating the weights. Shouldn’t you then also use that log transformed value as input for the dot product that computes the prediction instead of the regular feature value?

    Thanks!

    1. 1) I want to do feature normalization so I can use datasets with feature values in the ranges 0-100. It was not needed for these sparse text competitions, but on other competitions this algo failed (and feature normalization made it fail less hard). We do a form of count vectorization, not binary vectorization. If one word is found twice inside a text, the weights are updated with log( 2 ) + log ( 2 ) = ~0.60. Without taking the log this would be 1 + 1 = 2.

      2) Maybe? :). I don’t know. Perhaps that’s the correct way to do it. For now this works (use log for feature value transform, use dotproduct for predictions and progressive loss). I could have done away with the log (feature val + 1) entirely, and it would not have made a difference for these competitions. So if you think its cleaner that way, feel free to take the authentic update rule.

      1. Yeah, you have a point. The updating rule is indeed doing a sort of feature counting. I guess then it sense to normalize it.

        Cool, thanks!

  4. Thanks for the great explanation! I am new to this field and had one question about hashing trick in general.

    Suppose our string is: “This movie is great. Actors are great.”
    Then the feature set generated is : [(712, 1), (169, 1), (394, 1), (605, 1), (20, 1), (149, 1), (605, 1)]

    How will duplicates ((605,1) for 2 “great.”) be handled? Are we going to loose the fact that great was used twice in the statement?

    1. In our update rule we iterate through all our non-zero features. So if a word appears twice, it will update the weights twice for that word.

      You could opt for pure binary vectorizer. That way words are only counted once. A simple list(set(features)) will make every feature unique.

      You could also write a pure count vectorizer. Features with the same feature_index would have their values added: (605,2).

      We do a form of count vectorization: [(605,log(1+1)),(605,log(1+1))].

  5. It seems like the features that you are using corresponding to a word is log(2)*’number of times the word appeared’ in a review. Could you please clarify the reason behind multiplying with the log(2). Thanks in advance.

  6. Unlike a CountVectorizer, your simply hashing every word of the sentence but not adding counts right?

    1. We hash every word, so if a word appears twice, it is hashed twice, and then the weights for that word are updated twice. So we do add the counts.

  7. Hi,
    for a college course we have to implement some solutions for the same Kaggle competition.
    I’m currently programming the Perceptron solution.

    When using the labeledTrainData as training data, and also testData (all the 25000 reviews) the results show a 100% coincidence, implying (sort of) that the weights were ok.

    However, when trying the actual testData, the results were awful. About 66%.

    Throwing a couple of couts (yes..C++ is mandatory…) the dotProduct showed results below 0, and some above 1.
    May be this the problem? Any hint or idea where I may be commiting a mistake?

    Thanks in advance
    Ps: pardon my english.

  8. Dont mind my previous comment. Was a silly mistake.
    I see you use up to 2grams. How would you predict improvement using k-grams? Apart from try & error.

    Another question I have is the use of log(1.+value) while training. Value is allways 1. So, in the way this is being used, would’t be the same using learning_rate = 0.1 * log(2) ? Or it is intended for modifications in the features set. Eg: For multiple ocurrencies for one word (6059, 2) ?

    Thanks in advance, excellent post. Made me understand better Perceptron.

    PS: Pardon my english.

    1. How would you predict improvement using k-grams? Apart from try & error.

      In ML it does not matter what I predict, it matters what the machine predicts. If the machine likes 2-grams with 2-skips over 3-grams, who am I to disagree? :).
      For guideline: 2-grams often help with bag-of-words. 3-grams can too. But curse of dimensionality starts to set in with a lot of (higher) k-grams. You can look at skip-grams “Alice and Bob.” = alice_bob:1 and check out word2vec. Try it!

      Another question I have is the use of log(1.+value) while training. Value is allways 1. So, in the way this is being used, would’t be the same using learning_rate = 0.1 * log(2) ?

      Log transform is done to make the update a bit smaller (crude way to account for the fact that we update a word multiple times if it appears multiple times). You could either ignore the log transform, or do it properly: log(1+nr_times_word_appears_in_document).

  9. Why do we get an average of around 75% in the first pass?
    Since no learning happens in the first pass, we should see something around 50% and improvement from there after subsequent passes.

    Am I missing something? Can someone please help me understand this?

    1. We start learning (updating the weights) with the first sample we encounter (online learning). After a first pass (25k samples) the weights should be updated to have at least some predictive capability.

  10. Hi Triskelion, i am implementing multiclass version of this. Idea is to use different weight vectors for each class. Do you have any other ideas ? please suggest

  11. Hi,

    I’m trying to apply this to a similar classification problem and had a question: To turn the submission file into just binary 1 (positive) and 0 (negative), should I convert 0.5-1 into 1s and 0-0.49 into 0s?

    Thanks

Leave a Reply

Your email address will not be published. Required fields are marked *