Some experiments and research into applying Normalized Compression Distance (NCD) to chess games. The goal is to cluster the games of different grand masters. Using Bzip as the compressor we get an accuracy of 71% on 21 files from 7 different grand masters.
NCD? A short introduction
NCD, in layman terms, works on the principle that two files which share data patterns compress better when you add them together, than two files which don’t share data patterns, since compressors are more efficient when dealing with repeating data patterns.
- http://complearn.org/ncd.html The website accompanying the paper
- Clustering by compression [pdf] The original paper
A simple example
File A contents:
File B contents:
File C contents:
Using a very basic compressor called a Run-Length Encoder (RLE) we can compress these file contents.
File A contents compressed:
File B contents compressed:
File C contents compressed:
To get a similarity measure between files, you add two files together and look at the length of their compression.
File A contents + File B contents:
File A + File B compressed:
Which gives a compressed length of 6.
File B contents + File C contents:
File B + File C compressed:
Which gives a compressed length of 15.
As the files are already of equal length, we don’t need normalization to spot that File A and File B are more alike, than File B and File C.
NCD is an approximation. We would like to use the Kolmogorov Complexity (KC) for a string, but KC is incomputable. Better compressors are valuable to increase NCD’s accuracy and AI in general, so much so that Marcus Hutter made a price for it 50’000€ Prize for Compressing Human Knowledge.
Uses of NCD
NCD gives a similarity measure between two files or data streams which can be used to cluster, classify, compare, or even fed to autonomous agents.
An experiment clustering chess games
We take thousands of games of 7 grand masters and put their games in a file with the name of the grandmaster. Then we split these 7 files into 3 files. Our experiment is look if we can cluster these files back into 7 grand masters clusters with each their 3 files.
Clusters 21 files containing game data from 7 Grandmasters. All games played by a Grandmaster with the white pieces are stored inside a .txt file. This .txt file is split up in 3 smaller .txt files, which are randomized and stripped of meta-data (like the dates, locations, tournament or player names).
The challenge is to create 7 clusters containing the game data files belonging to the same Grandmaster. Opponent moves not stripped from the data. This makes the data less indicative of the Grandmaster’s unique style (data patterns) we hope to find. This adds to the challenge.
Expert knowledge or chess domain knowledge isn’t required for good results. An International Master confronted with this challenge claimed the challenge is impossible to do by hand and would take hours to even analyze.
A string with chess games moves.
1.Sf3 Sf6 2.c4 b6 3.g3 Lb7 4.Lg2 c5 5.0-0 g6 6.b3 Lg7 7.Lb2 0-0 8.Sc3 Sa6 9.d4 d5 10.dxc5 Sxc5 11.Sxd5 Sxd5 12.Lxg7 Kxg7 13.cxd5 Dxd5 14.Dxd5 Lxd5 15.Tfd1 ½-½ 1.d4 e6 2.e4 d5 3.Sd2 c5 4.exd5 exd5 5.Lb5+ Ld7 6.De2+ Le7 7.dxc5 Sf6 8.Sb3 0-0 9.Le3 Te8 10.Sf3 a6 11.Lxd7 Sbxd7 12.0-0 Sxc5 13.Sfd4 ½-½ 1.e4 c5 2.Sf3 Sc6 ...
Using the chess games of the following 7 Grandmasters, inspired by some of Fischer’s favorite opponents:
- (A) Gligoric
- (B) Petrosian
- (C) Reshevsky
- (D) Spassky
- (E) Tal
- (F) Botvinnik
- (G) Larsen
this script gives us the following clusters, together with their fitness/similarity score:
Using compressor: bz2 Clusters: (0.9187782850134746, 'data/chessgames\\C1_.txt', 'data/chessgames\\C2_.txt', 'data/chessgames\\F1_.txt') (0.9213097232083918, 'data/chessgames\\D2_.txt', 'data/chessgames\\E1_.txt', 'data/chessgames\\E2_.txt') (0.9255006608357625, 'data/chessgames\\B1_.txt', 'data/chessgames\\B2_.txt', 'data/chessgames\\B3_.txt') (0.9275354033562815, 'data/chessgames\\G1_.txt', 'data/chessgames\\G2_.txt', 'data/chessgames\\G3_.txt') (0.9286958400620074, 'data/chessgames\\A1_.txt', 'data/chessgames\\A2_.txt', 'data/chessgames\\A3_.txt') (0.931262816636599, 'data/chessgames\\C3_.txt', 'data/chessgames\\F2_.txt', 'data/chessgames\\F3_.txt') (0.9343742509153218, 'data/chessgames\\D1_.txt', 'data/chessgames\\D3_.txt', 'data/chessgames\\E3_.txt') Script execution time: 0:00:37
Worst possible result is 7 correct guesses. This would mean all clusters contain files from different Grandmasters.
Best possible result is 21 correct guesses. This would mean all clusters contain the files from the same Grandmaster.
The script scores 17 correct guesses for a score of 10/14 or 71%.
Bz2 outperforms other compressors with regards to speed and accuracy. Clustering 15 files or less often results in perfectly correct clusters. Clustering 24 files or more starts to give poor results. Currently looking at Google’s Snappy and PAQ8.
Python code and the data sets for this experiment are available at Github.