matryoska doll

k-Nearest Neighbors and Clustering on Compressed Binary Files

Normalized Compression Distance (Cilibrasi & Vitanyi) returns a similarity measure between binary files. This similarity measure allows for nearest neighbors search, clustering and classification. We are going to try some of these methods and review the results.

This is a draft version.

Normalized Compression Distance

In the post about normalized compression distance on chess games we already touched on the subject of NCD. The paper “Clustering by Compression [pdf]” contains a formal introduction.

An intuitive view

We have 3 e-mails. Two of them are spam e-mails, one is a legit e-mail. We compress these e-mails with a compressor.

The more you can compress (shorten the file length) an e-mail,  the less random it is: Compressors excel at compressing repetitive and predictable data. Compressors do a bad job on random noise. When there is structure in the data, and when the compressor finds this structure, then it uses facts about this structure to compress files better.

If one spam word appears a lot, it will be saved in a dictionary (or compression table), and the compressor can substitute that word with the (shorter) key from the dictionary. In many compressors, words (or structures) that appear the most get assigned the shortest dictionary key. One could see how this creates a kind of signature, a histogram of structures.

compression process

If those two spam e-mails both contain 10 times the spam word “baconeater”. Then if you add those two e-mails together into one file and compress it,  the resulting file size will be smaller than if you would add a legit e-mail and a spam e-mail together. The compressor can make use of the repetitive data from both the files to keep the file size small.

The similarity measure from Normalized Compression Distance is a general and robust measure that aims to encompass all similarity measures like Levenshtein distance, edit distance and entropy distance.

A Deeper View

The book “Data compression explained” by Matt Mahoney is one of the best introductions to compression. Some quotes of interest:

  • Data has a universal but uncomputable probability distribution.
  • There is no algorithm that tests for randomness or tells you whether a string can be compressed any further.
  • The counting argument says that most strings are not compressible. So it is a rather remarkable fact that most strings that we care about, for example English text, images, software, sensor readings, and DNA, are in fact compressible. These strings generally have short descriptions, whether they are described in English or as a program in C or x86 machine code.
  • The digits of π are not really random. The digits are only unknown until you compute them. An intelligent compressor might recognize the digits of π and encode it as a description meaning “the first million digits of pi” … there are very small programs that can output π, some as small as 52 bytes.
  • Compression is an artificial intelligence problem. Prediction is intuitively related to understanding. If you understand a sequence, then you can predict it. If you understand a language, then you could predict what word might appear next in a paragraph in that language. If you understand an image, then you could predict what might lie under portions of the image that are covered up. Alternatively, random data has no meaning and is not predictable. This intuition suggests that prediction or compression could be used to test or measure understanding.

There is a difference between lossless and lossy compression. With lossless compression you do not lose any information. A lossless compression formats for images is .PNG. With lossy compression you may lose some information (either visible in the case of .JPG or nearly inaudible in the case of .MP3).

lossy vs. original signal

We could lossy compress a corpus of millions of documents by creating a 32-bit binary signature for every document. We take 32 anchor words (most popular or randomly sampled) and for every document we  set a position in a binary string to 1 if it contains that word, and to 0 if it doesn’t.

anchors = ["Alice", "Bob", "cat", "mouse"]

doc_1, sig_1 = "'Curiouser!' cried Alice.", "1000"

doc_2, sig_2 = "Alice passed the cat to Bob.", "1110"

doc_3, sig_3 = "It's a cat and mouse game.", "0011"

To “decompress” the binary string we generate the words that occur in the signature * the average document size. Decompressing "1110" (original "Alice passed the cat to Bob") could become "Alice Bob cat Alice Bob cat".

If two documents are similar then their signatures when added together are easier to compress. If two documents both have the words “Alice”, “Bob” and “Cat” their combined signature would be “11101110”. This could be compressed as 2*1110. If two documents have different words their combined signature could be “11100011”. This contains less repeating structure to use for compression.

In essence, any representation of structure and statistics is a (lossy) compressed version of the raw data. A Bayesian filter or a n-gram histogram can be seen as lossy compression! Decompressing a Bayesian spam filter trained on a large corpus generates the archetypal spam e-mail: the e-mail containing the typical structure of a spam e-mail.

Likewise, if one were to compress the structures of a musician playing on a piano (velocity, note, tempo etc.), one can decompress this to an archetypal  musical piece. Using genetic algorithms or Markov chains one can use NCD as a fitness function to generate computer music. This will be the topic of a future post.

Compressors

NCD has been used to create an evolutionary tree based on mtDNA sequences. This research benefits from using the best (benchmarked) compressors currently known: The PAQ Data Compression Programs.

Clustered DNA sequencing

PAQ and PPM use a technique known as arithmetic coding. They offer excellent results at the cost of compression and decompression times.

Enter Snappy

Google has build one of the fastest compressors for on-the-fly compression and decompression called: Snappy. It aims for very high speeds and reasonable compression.

We will use Snappy and Bzip. Bzip has given consistent results and Snappy allows us to compute the similarity between files of many MBs in split seconds.

As data is stored in Snappy format at Google, further study of Normalized Snappy Distance could prove beneficial. Link graph compression to estimate entropy, and search/classification on Snappy-compressed database indexes will be the topics of future posts.

Running a Succesful Experiment

We create a data set of 28 files, total of 335MB. 18 files are .MP3 files, taken from two .mp3 releases made with different encoders. We expect that compressors will find this meta-data inside the .MP3 files. LAME encoders leave clear ascii markers. 4 files are .mp4 videos downloaded from a single youtube account. We expect that compressors will find the differences in structure between .mp3 and .mp4 files.

We ask the 3-nearest neighbors for every file:

...

[1.000] data\files\01 - Purple Haze.mp3
[0.883] data\files\05 - Fire.mp3
[0.868] data\files\02 - Hey Joe.mp3
[0.861] data\files\07 - Little Wing.mp3

....

[1.000] data\files\01 Kanye West - On Sight.mp3
[0.860] data\files\07 Kanye West - Blood On The Leaves.mp3
[0.847] data\files\02 Kanye West - Black Skinhead.mp3
[0.838] data\files\05 Kanye West - Hold My Liquor.mp3

....

[1.000] data\files\13. Bullet Chess Game Online.mp4
[0.705] data\files\14. Bullet Chess Game Online.mp4
[0.696] data\files\15. Bullet Chess Game Online.mp4
[0.687] data\files\17. Bullet Chess Game Online.mp4

Then to use these similarities to create hierarchical clusters.

0 data\files\01 - Purple Haze.mp3
0 data\files\05 - Fire.mp3
0 data\files\02 - Hey Joe.mp3
0 data\files\07 - Little Wing.mp3
0 data\files\06 - Highway Chile.mp3
0 data\files\03 - Crosstown Traffic.mp3
0 data\files\04 - The Wind Cries Mary.mp3
0 data\files\08 - All Along The Watchtower.mp3
0 data\files\09 - Manic Depression.mp3
0 data\files\10 - Foxy Lady.mp3
0 data\files\11 - Spanish Castle Magic.mp3
0 data\files\12 - Burning Of The Midnight Lamp.mp3
0 data\files\13 - Castles Made Of Sand.mp3

1 data\files\01 Kanye West - On Sight.mp3
1 data\files\02 Kanye West - Black Skinhead.mp3
1 data\files\03 Kanye West - I Am A God.mp3
1 data\files\04 Kanye West - New Slaves.mp3
1 data\files\05 Kanye West - Hold My Liquor.mp3
1 data\files\06 Kanye West - I'm In It.mp3
1 data\files\07 Kanye West - Blood On The Leaves.mp3
1 data\files\08 Kanye West - Guilt Trip.mp3
1 data\files\09 Kanye West - Send It Up.mp3
1 data\files\10 Kanye West - Bound 2.mp3

2 data\files\13. Bullet Chess Game Online.mp4
2 data\files\14. Bullet Chess Game Online.mp4
2 data\files\15. Bullet Chess Game Online.mp4
2 data\files\16. Bullet Chess Game Online.mp4
2 data\files\17. Bullet Chess Game Online.mp4

This is a perfect fit for a script running time of 6 seconds.

Todo:

k-Nearest Neighbors

Clustering

Leave a Reply

Your email address will not be published. Required fields are marked *