Useless but mildly interesting language model using compressors built-in to Python.

## Usage

You can “train” it using some training data:

```
data = open(my_favorite_text_file).read().lower()
alphabet = "qwertyuiopasdfghjklzxcvbnm,.;1234567890 "
model = ziplm.ZipModel(alphabet, training=data)
"".join(model.sample_sequence(10)) # sample 10 characters from the alphabet
```

You can also run it without any training data, and just forward sample to see what kinds of patterns gzip likes:

```
alphabet = "qwertyuiopasdfghjklzxcvbnm "
model = ziplm.ZipModel(alphabet)
"".join(model.sample_sequence(100)) # I get 'oodegeeup ewjlrm chuzjqvrm chukkzfcjmwndxnlrm o qewmxweqswdqrlfiuyxlygxkxinsvrwrfjr ywmzwc mxhqerxzy'
```

You can also get the probability for a sequence:

```
alphabet = "qwertyuiopasdfghjklzxcvbnm "
model = ziplm.ZipModel(alphabet)
model.sequence_logprob("this is my favorite string") # I get -83.8
```

You can also try using `bz2`

and `lzma`

as language models by passing them as the `compressor`

argument to the model

```
import lzma
model = ziplm.ZipModel(alphabet, compressor=lzma)
"".join(model.sample_sequence(100)) # I get 'pvmucalppovpriitgztwbwgksgphhnhdf huykoppunlrh izjbokivpcx eagjqiyfmaibjconivxobcmipdccwqoyiwxzisgzx'
```

## Why does this “work”?

This works because of two facts:

- A language model is nothing but a distribution on the next token given previous tokens,
$p(x mid c)$ . - There is a general equivalence between
*probability distributions*and*codes*.

The second point is what makes this interesting. Information theory tells us that we can derive codes from probability distributions. That is, if I have some datapoints

Since I can convert probability distributions to codes in this way, I can also convert codes to probability distributions. If I have a code (like gzip) that describes my datapoint with length

The ZipLM model works by converting code lengths to probabilities in this way. If I have a vocabulary of size

where the proportionality reflects the fact that we have to sum over the compressed lengths of

## Leave A Comment