Lossless Compression of English Short Messages
This lossless compressor achieves a much higher compression rate on English texts than general purpose compressors. Its typical compression ratio is 15% (number of output bits divided by the number of input bits).
The compression is achieved by using the probability of the next word computed by the GPT-2 language model released by OpenAI. It is a neural network of 345 million parameters based on the Transformer architecture (the largest GPT-2 model of 1.5 billion parameters brings marginal improvement when compressing short messages). An arithmetic coder generates the bit stream. For this demo, each compressed character holds 15 data bits by using the CJK and the Hangul Syllables unicode ranges.
It is implemented using the
LibNC library and runs
on a standard PC. The Linux standalone command line
gpt2tc) can be downloaded
ratios on several text compression benchmarks is listed in
A similar model can be used to complete text messages.