Let’s take a look at the perceptron: the simplest artificial neuron. This article goes from a concept devised in 1943 to a Kaggle competition in 2015. It shows that a single artificial neuron can get 0.95 AUC on an NLP sentiment analysis task (predicting if a movie review is positive or negative).
In logic there are no morals. Everyone is at liberty to build up his own logic, i.e., his own form of language, as he wishes. – Rudolf Carnap (1934) “Logical Syntax of Language”
The birth of artificial neural nets started with the 1943 paper “a
Logical Calculus of the Ideas Immanent in Nervous Activity”. Two researchers, McCulloch a neurologist, Pitts a logician, joined forces to sketch out the first artificial neurons.
McCulloch reasoned that nervous activity had an ‘all-or-nothing’ activity: A neuron would fire (output “1”) once it’s activation threshold was reached, or it wouldn’t fire (output “0”).
Pitts saw and understood the potential to capture propositional logic using such neurological principles.
A bird will try to eat brown round objects, such as seeds. We can represent this behaviour as:
One could model the above table with a McCulloch-Pitts Neuron. If you set the activation threshold to 1.5, the neuron would only fire when both “brown” AND “round” properties are satisfied. Then the sum of the inputs is 2 (1+1), which is larger than the activation threshold of 1.5.
As a condition for the use of logical inferences and the performance of logical operations, something must already be given to our faculty of representation, certain extralogical concrete objects that are intuitively present as immediate experience prior to all thought. – Hilbert (1926)
Invented in 1957 by cognitive psychologist Frank Rosenblatt, the perceptron algorithm was the first artificial neural net implemented in hardware.
In 1960 researchers at Cornell Aeronautical Laboratory, with funding from the US Office of Naval Research, randomly hooked 400 photocells to a perceptron and the “Mark 1 perceptron” was born. It was capable of basic image recognition.
A Perceptron works by assigning weights to incoming connections. With the McCulloch-Pitts Neuron we took the sum of the values from the incoming connections, then looked if it was over or below a certain threshold. With the Perceptron we instead take the dotproduct. We multiply each incoming value with a weight and take the sum: (value1 * weight1) + (value2 * weight2) etc.
A perceptron is a supervised classifier. It learn by first making a prediction: Is the dotproduct over or below the threshold? If it over the threshold it predicts a “1”, if it is below threshold it predicts a “0”.
Then the perceptron looks at the label of the sample. If the prediction was correct, then the error is “0”, and it leaves the weights alone. If the prediction was wrong, the error is either “-1” or “1” and the perceptron will update the weights like:
weights[feature_index] += learning_rate * error * feature_value
Above code corrected: Thank you leakymirror.
Here a perceptron learns to model a NAND function without any errors:
""" Creative Commons A-SA, source: http://en.wikipedia.org/wiki/Perceptron """ threshold = 0.5 learning_rate = 0.1 weights = [0, 0, 0] training_set = [((1, 0, 0), 1), ((1, 0, 1), 1), ((1, 1, 0), 1), ((1, 1, 1), 0)] def dot_product(values, weights): return sum(v * w for v, w in zip(values, weights)) while True: print('-' * 60) error_count = 0 for input_vector, desired_output in training_set: print(weights) result = dot_product(input_vector, weights) > threshold error = desired_output - result if error != 0: error_count += 1 for index, value in enumerate(input_vector): weights[index] += learning_rate * error * value if error_count == 0: break
And there we have it, a simple linear classifying single node neural net: the perceptron.
The embryo of an electronic computer that the Navy expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence. – The New York Times reporting on the Perceptron (1958)
Online Learning Perceptron in Python
We are going to implement the above Perceptron algorithm in Python. We use only standard libraries so the script will run on PyPy (3-4 speedups), taking massive inspiration from tinrtgu’s online logistic regression script first seen on the Kaggle forums: “Beat the benchmark with less than 200mb of memory.”. You can find our script at the MLWave Github repo.
The perceptron is capable of online learning (learning from samples one at a time). This is useful for larger datasets since you do not need entire datasets in memory.
Now to steal some feature we’ve seen in our favorite online machine learning tool — Vowpal Wabbit:
The vectorizing hashing trick originated with Vowpal Wabbit (John Langford). This trick sets number of incoming connections to the perceptron to a fixed size. We hash all the raw features to a number lower than the fixed size.
Let’s take a sample like:
This movie sucks
it may be vectorized with hashing as:
sample = "This movie sucks" fixed_size = 1024 features = [(hash(f)%fixed_size,1) for f in sample.split()] print features
To get the sparse (all non-zero elements) format:
# list of tuples in form (feature_index,feature_value) >> [(712, 1), (169, 1), (364, 1)]
Progressive Validation Loss
Learning from samples one at a time also gives us progressive training loss. When the model encounters a sample it first makes a prediction without looking at the target.
Then we compare this prediction with the target label and calculate an error rate. A low error rate tells us we are close to a good model.
“No 9000 computer has ever made a mistake or distorted information. We are all, by any practical definition of the words, foolproof and incapable of error.” — HAL 9000 (1968)
Vowpal Wabbit has online feature value normalization. We use a simple log(feature value + 1) for now and politely bow at the master’s magnificence.
With Vowpal Wabbit you can run multiple passes over the dataset with a decaying learning rate.
We can also run through datasets multiple times for as long as the error rate gets lower and lower. If the training data is linearly separable then a perceptron should, in practice, converge to 0 errors on the train set.
“By the end, if you make it that far, you’ll be hoping that the rabbit completes its mission.” – User Review for the movie “The Nasty Rabbit” (1964)
When bag of words meets bags of popcorn
The task is to predict if a review has an accompanying rating less than 5 (negative), or over 7 (positive) for 25k test samples. We have a labeled train dataset of 25k samples to build our model.
We write a small function to run through the .TSV datasets and yield the label, identifier, hashed features and feature values.
The current best benchmark on the forums is a logistic regression script from Abhishek. It does an in-memory tfidf-fit_transform on both the train and the test set to get 0.95 AUC on the leaderboard. We skip that transform and stick to our online hashing vectorizer.
We do generate 2-grams (such as “not good”) from the features, much like the script from Abhishek does. We simply hash these 2-grams and add them to the sample vectors.
We also quickly clean the text a little by lowercasing and removing anything not alphanumeric.
To optimize for AUC (the competition metric) we do not submit only “1” or “0” predictions. This would give us an AUC of around 0.88. We submit the dotproduct normalized between 0 and 1.
Our script matches the solution from Abhishek with a score of 0.952. But it takes just a few MB of memory, doesn’t do tfidf transform and converges to 0 errors on train set in just over 2 minutes.
Without cleaning and without 2grams the script runs in less than 60 seconds and produces a score of 0.93.
Pass Errors Average Nr. Samples Since Start 1 5672 0.77312 25000 0:00:04.353000 2 3110 0.8756 25000 0:00:08.464000 3 2280 0.9088 25000 0:00:12.482000 4 1659 0.93364 25000 0:00:16.485000 ... 19 138 0.99448 25000 0:01:15.814000 20 110 0.9956 25000 0:01:19.922000 21 83 0.99668 25000 0:01:23.843000 ... 29 40 0.9984 25000 0:01:54.922000 30 36 0.99856 25000 0:01:58.802000 31 17 0.99932 25000 0:02:02.607000 32 2 0.99992 25000 0:02:06.521000 33 0 1.0 25000 0:02:10.382000 Zero errors found during training, halting Testing Errors Average Nr. Samples Since Start 12431 0.50276 25000 0:00:03.938000 Done testing in 0:00:03.981000
“A recurrent neural works, made his idea to the mentally crucifix ‘Cato Green Ronnulae Schill'” — Ilya Sutskever’s text generating RNN
ADCG SS14 Challenge 02 – Spam Mails Detection
I already wrote about the spam mail detection challenge before. I gave a lot of credit to the co-competitors who had to write their own classifiers. I am proud to be able to say I now did this competition with code I wrote myself too.
The top result for this competition was my own Vowpal Wabbit solution at 0.998 accuracy. I wrote a small parser to yield the label, id, and features from my old VW datasets.
This online learning perceptron script is able to exactly match the Vowpal Wabbit solution (0.99836). Perhaps we have a light-weight Perceptron script that is closer to a Vowpal Wabbit with hinge loss than we dare to think.
A simple question: “What exactly is information and data communication?” may be answered using two entirely different theories. The conventional Shannon information theory regards all information as “quantities”, which are transmitted using bits and bytes in meaningless bit streams. This outdated theory is still being taught at our universities, even though no biological creature communicates in this way. A biologically correct Autosophy information theory, in contrast, regards all information as “addresses”, which are communicated to create new knowledge in a receiver. — Klaus Holtz (1987)
Did not work
All challenges where the data is not pre-normalized between 0-1 gave poor results. This Perceptron script obviously likes sparse one-hot-encoded data, such as text. Also regression and multi-class classification is not implemented yet.
You can find the script used for the movie review challenge at the MLWave Github Repo. Feedback, like adaptive normalized invariance implementations always welcome ;).
“For me there is no absolute knowledge: everything goes only by probability” – Gödel (1961)
A very basic perceptron with 0-1 threshold activation can do well in NLP contests. We did not change the 1957 algorithm that much at all. We could do away with the hashing trick and receive similar or slightly better scores.
This script can not output a probability. One could use a bounded sigmoid on the dotproduct to get a form of online normalization of the output or better study tinrtgu’s script.
Since a single-node single-layer Perceptron can not model non-linear functions, to get better NLP results we will have go deeper and look at neural network structures like feedforward nets, recurrent nets, self-organizing maps, Multi-layer Perceptrons, word2vec and extreme learning machines (fast ffnets without backprop).
Update: Further Reading
Artificial Neurons and Single-Layer Neural Networks – How Machine Learning Algorithms Work Part 1 Thank you Sebastian Raschka