Let’s take a look at the perceptron: the simplest artificial neuron. This article goes from a concept devised in 1943 to a Kaggle competition in 2015. It shows that a single artificial neuron can get 0.95 AUC on an NLP sentiment analysis task (predicting if a movie review is positive or negative).
In logic there are no morals. Everyone is at liberty to build up his own logic, i.e., his own form of language, as he wishes. – Rudolf Carnap (1934) “Logical Syntax of Language”
McCulloch-Pitts Neuron
The birth of artificial neural nets started with the 1943 paper “a
Logical Calculus of the Ideas Immanent in Nervous Activity”. Two researchers, McCulloch a neurologist, Pitts a logician, joined forces to sketch out the first artificial neurons.
McCulloch reasoned that nervous activity had an ‘all-or-nothing’ activity: A neuron would fire (output “1”) once it’s activation threshold was reached, or it wouldn’t fire (output “0”).
Pitts saw and understood the potential to capture propositional logic using such neurological principles.
An example
A bird will try to eat brown round objects, such as seeds. We can represent this behaviour as:
Object | Brown? | Round? | Eat? |
---|---|---|---|
Seed | 1 | 1 | 1 |
Leaf | 1 | 0 | 0 |
Golf Ball | 0 | 1 | 0 |
Key | 0 | 0 | 0 |
One could model the above table with a McCulloch-Pitts Neuron. If you set the activation threshold to 1.5, the neuron would only fire when both “brown” AND “round” properties are satisfied. Then the sum of the inputs is 2 (1+1), which is larger than the activation threshold of 1.5.
As a condition for the use of logical inferences and the performance of logical operations, something must already be given to our faculty of representation, certain extralogical concrete objects that are intuitively present as immediate experience prior to all thought. – Hilbert (1926)
Perceptron
Invented in 1957 by cognitive psychologist Frank Rosenblatt, the perceptron algorithm was the first artificial neural net implemented in hardware.
In 1960 researchers at Cornell Aeronautical Laboratory, with funding from the US Office of Naval Research, randomly hooked 400 photocells to a perceptron and the “Mark 1 perceptron” was born. It was capable of basic image recognition.
Weights
A Perceptron works by assigning weights to incoming connections. With the McCulloch-Pitts Neuron we took the sum of the values from the incoming connections, then looked if it was over or below a certain threshold. With the Perceptron we instead take the dotproduct. We multiply each incoming value with a weight and take the sum: (value1 * weight1) + (value2 * weight2) etc.
Learning
A perceptron is a supervised classifier. It learn by first making a prediction: Is the dotproduct over or below the threshold? If it over the threshold it predicts a “1”, if it is below threshold it predicts a “0”.
Then the perceptron looks at the label of the sample. If the prediction was correct, then the error is “0”, and it leaves the weights alone. If the prediction was wrong, the error is either “-1” or “1” and the perceptron will update the weights like:
weights[feature_index] += learning_rate * error * feature_value
Above code corrected: Thank you leakymirror.
Example
Here a perceptron learns to model a NAND function without any errors:
""" Creative Commons A-SA, source: http://en.wikipedia.org/wiki/Perceptron """ threshold = 0.5 learning_rate = 0.1 weights = [0, 0, 0] training_set = [((1, 0, 0), 1), ((1, 0, 1), 1), ((1, 1, 0), 1), ((1, 1, 1), 0)] def dot_product(values, weights): return sum(v * w for v, w in zip(values, weights)) while True: print('-' * 60) error_count = 0 for input_vector, desired_output in training_set: print(weights) result = dot_product(input_vector, weights) > threshold error = desired_output - result if error != 0: error_count += 1 for index, value in enumerate(input_vector): weights[index] += learning_rate * error * value if error_count == 0: break
And there we have it, a simple linear classifying single node neural net: the perceptron.
The embryo of an electronic computer that the Navy expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence. – The New York Times reporting on the Perceptron (1958)
Online Learning Perceptron in Python
We are going to implement the above Perceptron algorithm in Python. We use only standard libraries so the script will run on PyPy (3-4 speedups), taking massive inspiration from tinrtgu’s online logistic regression script first seen on the Kaggle forums: “Beat the benchmark with less than 200mb of memory.”. You can find our script at the MLWave Github repo.
Online learning
The perceptron is capable of online learning (learning from samples one at a time). This is useful for larger datasets since you do not need entire datasets in memory.
Now to steal some feature we’ve seen in our favorite online machine learning tool — Vowpal Wabbit:
Hashing trick
The vectorizing hashing trick originated with Vowpal Wabbit (John Langford). This trick sets number of incoming connections to the perceptron to a fixed size. We hash all the raw features to a number lower than the fixed size.
Let’s take a sample like:
This movie sucks
it may be vectorized with hashing as:
sample = "This movie sucks" fixed_size = 1024 features = [(hash(f)%fixed_size,1) for f in sample.split()] print features
To get the sparse (all non-zero elements) format:
# list of tuples in form (feature_index,feature_value) >> [(712, 1), (169, 1), (364, 1)]
Progressive Validation Loss
Learning from samples one at a time also gives us progressive training loss. When the model encounters a sample it first makes a prediction without looking at the target.
Then we compare this prediction with the target label and calculate an error rate. A low error rate tells us we are close to a good model.
“No 9000 computer has ever made a mistake or distorted information. We are all, by any practical definition of the words, foolproof and incapable of error.” — HAL 9000 (1968)
Normalization
Vowpal Wabbit has online feature value normalization. We use a simple log(feature value + 1) for now and politely bow at the master’s magnificence.
Multiple passes
With Vowpal Wabbit you can run multiple passes over the dataset with a decaying learning rate.
We can also run through datasets multiple times for as long as the error rate gets lower and lower. If the training data is linearly separable then a perceptron should, in practice, converge to 0 errors on the train set.
“By the end, if you make it that far, you’ll be hoping that the rabbit completes its mission.” – User Review for the movie “The Nasty Rabbit” (1964)
Kaggle Results
When bag of words meets bags of popcorn
This is a knowledge and tutorial contest on Kaggle. It uses a movie review dataset similar to the one in the paper “Learning Word Vectors for Sentiment Analysis“.
The task is to predict if a review has an accompanying rating less than 5 (negative), or over 7 (positive) for 25k test samples. We have a labeled train dataset of 25k samples to build our model.
We write a small function to run through the .TSV datasets and yield the label, identifier, hashed features and feature values.
Top benchmark
The current best benchmark on the forums is a logistic regression script from Abhishek. It does an in-memory tfidf-fit_transform on both the train and the test set to get 0.95 AUC on the leaderboard. We skip that transform and stick to our online hashing vectorizer.
We do generate 2-grams (such as “not good”) from the features, much like the script from Abhishek does. We simply hash these 2-grams and add them to the sample vectors.
We also quickly clean the text a little by lowercasing and removing anything not alphanumeric.
Predictions
To optimize for AUC (the competition metric) we do not submit only “1” or “0” predictions. This would give us an AUC of around 0.88. We submit the dotproduct normalized between 0 and 1.
Our script matches the solution from Abhishek with a score of 0.952. But it takes just a few MB of memory, doesn’t do tfidf transform and converges to 0 errors on train set in just over 2 minutes.
Without cleaning and without 2grams the script runs in less than 60 seconds and produces a score of 0.93.
Pass Errors Average Nr. Samples Since Start 1 5672 0.77312 25000 0:00:04.353000 2 3110 0.8756 25000 0:00:08.464000 3 2280 0.9088 25000 0:00:12.482000 4 1659 0.93364 25000 0:00:16.485000 ... 19 138 0.99448 25000 0:01:15.814000 20 110 0.9956 25000 0:01:19.922000 21 83 0.99668 25000 0:01:23.843000 ... 29 40 0.9984 25000 0:01:54.922000 30 36 0.99856 25000 0:01:58.802000 31 17 0.99932 25000 0:02:02.607000 32 2 0.99992 25000 0:02:06.521000 33 0 1.0 25000 0:02:10.382000 Zero errors found during training, halting Testing Errors Average Nr. Samples Since Start 12431 0.50276 25000 0:00:03.938000 Done testing in 0:00:03.981000
“A recurrent neural works, made his idea to the mentally crucifix ‘Cato Green Ronnulae Schill'” — Ilya Sutskever’s text generating RNN
ADCG SS14 Challenge 02 – Spam Mails Detection
I already wrote about the spam mail detection challenge before. I gave a lot of credit to the co-competitors who had to write their own classifiers. I am proud to be able to say I now did this competition with code I wrote myself too.
The top result for this competition was my own Vowpal Wabbit solution at 0.998 accuracy. I wrote a small parser to yield the label, id, and features from my old VW datasets.
This online learning perceptron script is able to exactly match the Vowpal Wabbit solution (0.99836). Perhaps we have a light-weight Perceptron script that is closer to a Vowpal Wabbit with hinge loss than we dare to think.
A simple question: “What exactly is information and data communication?” may be answered using two entirely different theories. The conventional Shannon information theory regards all information as “quantities”, which are transmitted using bits and bytes in meaningless bit streams. This outdated theory is still being taught at our universities, even though no biological creature communicates in this way. A biologically correct Autosophy information theory, in contrast, regards all information as “addresses”, which are communicated to create new knowledge in a receiver. — Klaus Holtz (1987)
Did not work
All challenges where the data is not pre-normalized between 0-1 gave poor results. This Perceptron script obviously likes sparse one-hot-encoded data, such as text. Also regression and multi-class classification is not implemented yet.
Code
You can find the script used for the movie review challenge at the MLWave Github Repo. Feedback, like adaptive normalized invariance implementations always welcome ;).
Conclusion
“For me there is no absolute knowledge: everything goes only by probability” – Gödel (1961)
A very basic perceptron with 0-1 threshold activation can do well in NLP contests. We did not change the 1957 algorithm that much at all. We could do away with the hashing trick and receive similar or slightly better scores.
This script can not output a probability. One could use a bounded sigmoid on the dotproduct to get a form of online normalization of the output or better study tinrtgu’s script.
Since a single-node single-layer Perceptron can not model non-linear functions, to get better NLP results we will have go deeper and look at neural network structures like feedforward nets, recurrent nets, self-organizing maps, Multi-layer Perceptrons, word2vec and extreme learning machines (fast ffnets without backprop).
Update: Further Reading
Artificial Neurons and Single-Layer Neural Networks – How Machine Learning Algorithms Work Part 1 Thank you Sebastian Raschka
The Man Who Tried to Redeem the World with Logic – Walter Pitts rose from the streets to MIT, but couldn’t escape himself. Thank you Stephen Oates
Super interesting post. Thank you!
Hi, thank you for this post!
I would like to know why the results are good if the testing set has about 50% of errors. 12400+ reviews are missclassified.
Thanks!
Hi Luis,
I set dummy (fake) labels for the test set. All labels for the test set are “1”. The 50% error rate then tells us that we have predicted a “0” about 50% of the time. With a good classifier this tells us a little about the distribution of the test set.
When I create a 5k holdout set from the train set, and use the remaining 20k to train and the holdout set to test, then it will give accuracy of around 0.88. Because then it has the real target labels and can calculate the real error rate.
Vowpal Wabbit used to do much the same: When testing you use a test set with all “1” labels. Then the average loss is still being reported. You can use this number to see if your model has changed (predicts more 1s or less 1s).
Got it, I didn’t notice the test set for this Kaggle competition had all 1s as labels :-).
the error you see is misleading. all the labels of testData are initialized with 1. since around 50% of them are not 1, you see the error of 50%
Better: Split the train data to 20K + 5K. Train on 20K and test it with 5K. You would see the accuracy of around 86%
I found one possible typo, which was breaking my code 🙂
weights[feature_index] = learning_rate * error * feature_value
should be expressed as
weights[feature_index] += learning_rate * error * feature_value
Whoops. Thank you! Updated.
Hi,
first of all: awesome post!
I have two questions regarding the perceptron:
1) Why do you want to do feature normalisation? The values of your features are always one; so what is there to normalize?
2) In your implementation you use the log transformed feature value ( log(1.+value) ) in the learning rule for updating the weights. Shouldn’t you then also use that log transformed value as input for the dot product that computes the prediction instead of the regular feature value?
Thanks!
1) I want to do feature normalization so I can use datasets with feature values in the ranges 0-100. It was not needed for these sparse text competitions, but on other competitions this algo failed (and feature normalization made it fail less hard). We do a form of count vectorization, not binary vectorization. If one word is found twice inside a text, the weights are updated with log( 2 ) + log ( 2 ) = ~0.60. Without taking the log this would be 1 + 1 = 2.
2) Maybe? :). I don’t know. Perhaps that’s the correct way to do it. For now this works (use log for feature value transform, use dotproduct for predictions and progressive loss). I could have done away with the log (feature val + 1) entirely, and it would not have made a difference for these competitions. So if you think its cleaner that way, feel free to take the authentic update rule.
Yeah, you have a point. The updating rule is indeed doing a sort of feature counting. I guess then it sense to normalize it.
Cool, thanks!
Thanks for the great explanation! I am new to this field and had one question about hashing trick in general.
Suppose our string is: “This movie is great. Actors are great.”
Then the feature set generated is : [(712, 1), (169, 1), (394, 1), (605, 1), (20, 1), (149, 1), (605, 1)]
How will duplicates ((605,1) for 2 “great.”) be handled? Are we going to loose the fact that great was used twice in the statement?
In our update rule we iterate through all our non-zero features. So if a word appears twice, it will update the weights twice for that word.
You could opt for pure binary vectorizer. That way words are only counted once. A simple list(set(features)) will make every feature unique.
You could also write a pure count vectorizer. Features with the same feature_index would have their values added: (605,2).
We do a form of count vectorization: [(605,log(1+1)),(605,log(1+1))].
cool, thanks!
Not sure if you saw this recent amazing article on Walter Pitts life (h/t Scott Aaronson) .
http://nautil.us/issue/21/information/the-man-who-tried-to-redeem-the-world-with-logic
It seems like the features that you are using corresponding to a word is log(2)*’number of times the word appeared’ in a review. Could you please clarify the reason behind multiplying with the log(2). Thanks in advance.
You can ignore log(2) and it will work.
Unlike a CountVectorizer, your simply hashing every word of the sentence but not adding counts right?
We hash every word, so if a word appears twice, it is hashed twice, and then the weights for that word are updated twice. So we do add the counts.
Hi,
for a college course we have to implement some solutions for the same Kaggle competition.
I’m currently programming the Perceptron solution.
When using the labeledTrainData as training data, and also testData (all the 25000 reviews) the results show a 100% coincidence, implying (sort of) that the weights were ok.
However, when trying the actual testData, the results were awful. About 66%.
Throwing a couple of couts (yes..C++ is mandatory…) the dotProduct showed results below 0, and some above 1.
May be this the problem? Any hint or idea where I may be commiting a mistake?
Thanks in advance
Ps: pardon my english.
Dont mind my previous comment. Was a silly mistake.
I see you use up to 2grams. How would you predict improvement using k-grams? Apart from try & error.
Another question I have is the use of log(1.+value) while training. Value is allways 1. So, in the way this is being used, would’t be the same using learning_rate = 0.1 * log(2) ? Or it is intended for modifications in the features set. Eg: For multiple ocurrencies for one word (6059, 2) ?
Thanks in advance, excellent post. Made me understand better Perceptron.
PS: Pardon my english.
In ML it does not matter what I predict, it matters what the machine predicts. If the machine likes 2-grams with 2-skips over 3-grams, who am I to disagree? :).
For guideline: 2-grams often help with bag-of-words. 3-grams can too. But curse of dimensionality starts to set in with a lot of (higher) k-grams. You can look at skip-grams “Alice and Bob.” = alice_bob:1 and check out word2vec. Try it!
Log transform is done to make the update a bit smaller (crude way to account for the fact that we update a word multiple times if it appears multiple times). You could either ignore the log transform, or do it properly: log(1+nr_times_word_appears_in_document).
Where are all the great new posts? 🙂
Thanks! I decided to finish my Kaggle ensemble guide after this comment! 🙂
Why do we get an average of around 75% in the first pass?
Since no learning happens in the first pass, we should see something around 50% and improvement from there after subsequent passes.
Am I missing something? Can someone please help me understand this?
We start learning (updating the weights) with the first sample we encounter (online learning). After a first pass (25k samples) the weights should be updated to have at least some predictive capability.
That answers my question. thanks a lot Triskelion.
Hi Triskelion, i am implementing multiclass version of this. Idea is to use different weight vectors for each class. Do you have any other ideas ? please suggest
Heya,
I am not familiar enough yet with multiclass yet, so I can not give solid advice. I myself would look at the One vs. All method.
Here Andrew Ng explains it: https://class.coursera.org/ml-005/lecture/38
I updated/refactored this Perceptron script for you, maybe that can be of help:
https://github.com/MLWave/online-learning-perceptron/blob/master/class_perceptron.py
Good luck and let us know how it goes!
Hi,
I’m trying to apply this to a similar classification problem and had a question: To turn the submission file into just binary 1 (positive) and 0 (negative), should I convert 0.5-1 into 1s and 0-0.49 into 0s?
Thanks
Yes, I’d do that at first. Later you may check out other thresholds that may lead to a higher accuracy.
Also, the perceptron starts at 13 errors and goes into 40-50 errors for me. Isn’t that weird?
Yes, that seems a bit weird, though I am not familiar with your data and size.
I’m using it to classify tweets in French by sentiment. It converges at 14 errors