Empirical Frequency Natural Language Processing

Project description

efnlp

This is the jankiest possible take at computing n-gram or conditional empirical frequency ((C)EF) distributions for successor tokens given a fixed-width prefix. Yes that's just a Markov model. In a way this is purposefully naive, and interested in how far such a naive technique goes.

There's python and c++ code (in efnlp and cpp respectively) for analyzing text and creating sampling datastructures. There's also go code (in go) for a gRPC server to read sampling datastructures and serve batch or streaming generated text.

Motivation (even though we just playin)

The basic idea is to explore computing these values and using them in generative sampling tasks. We "hypothesize" that the asymptotic (in data and model class universality) result of computational (L)LMs is a high-fidelity representation of the (C)EF in the training data. This is a bit tautologous, for a consistent statistical model estimated on sufficient data. See paper/*.pdf for more; however I have had repeated personal experiences with really cool complicated models basically reducing to the information in counts and averages, much like (C)EFs.

Of course, a (C)EF "model" should have serious generalizability challenges when confronted with valid sequences not appearing in the training data. Particularly if we wish to generate text (via token sequences), we may generate sequences purely by chance that have no (or limited) precedent and thus have limited predictive capacity. (A Markov model can always reduce to token occurrence frequencies, though randomly choosing tokens is far from language modeling.) This is a primary goal of statistical models: to find (or use) structure to "fill in the gaps" of a particular corpus of training data. However, as a training corpus grows to encompass more and more of the possibilities of discourse, the likelihood we have not "seen" a valid sequence should also decrease, and a model should also be getting progressively more aligned with the (C)EFs.

Also we would expect the required storage for a purely (C)EF "model" to grow quite large. This is, in principal, another primary goal of statistical models: to "compress" statistical information without diminishing validation/generalization performance. Yet, the SOTA statistical models are themselves quite extraordinarily large; samples are not reasonable computable on a single machine, involving many billions of parameters and linear algebra involved enough to require specialized hardware.

Obviously if simple (C)EF models were of comparable quality to modern LLMs they would be making all the noise instead. This isn't about a better approach, just a personal investigation.

Example

You can run the python code from the CLI as

$ python -m efnlp -c data/tinywillspeare.txt -m -s -b 10 -g 100000 -o sample-results.txt
[2023-01-28T21:37:56.223686 | 0.000020s] Reading text corpus
[2023-01-28T21:37:56.225040 | 0.001374s] Forming (character) language
[2023-01-28T21:37:56.273636 | 0.049969s] Encoding corpus in the language constructed
[2023-01-28T21:37:56.345648 | 0.122002s] Corpus is 1,115,393 tokens long
[2023-01-28T21:37:56.345711 | 0.122036s] Parsing prefix/successor tokens
[2023-01-28T21:38:24.269634 | 28.045963s] Parsed prefixes and successors in corpus in 27923.90ms
[2023-01-28T21:38:27.680794 | 31.457128s] Found 858,920 prefixes (77.0% of possible)
[2023-01-28T21:38:34.836321 | 38.612656s] Found 937,254 patterns (84.0% of possible)
[2023-01-28T21:38:37.554198 | 41.330531s] Memory (roughly) required: 31.9 MB (about 4,177,347 dbl, 8,354,695 fl)
[2023-01-28T21:38:37.554237 | 41.330562s] Sampling and decoding 100000 tokens
[2023-01-28T21:38:38.493836 | 42.270165s] Generation frequency: 9.4 us/tok
[2023-01-28T21:38:38.493869 | 42.270194s] Writing sampled results to sample-results.txt

(for which you can see generated text in results). The compiled c++ is similar,

cpp$ ./efnlp++ -c data/tinywillspeare.txt -m -s -b 10 -g 100000 -o sample-results.txt

or the go

go$ go run *.go -parse \
	-input ../data/tinywillspeare.txt \
	-language ../cpp/language.proto.bin \
	-block 10 \
	-generate 10000 \
	-print=false

(note the use of a language datastructure in proto format).

Our "model" here (in python) is, more or less, 8M doubles worth of "parameters" and "trains" (estimates) in a single process on an (old) macbook in under a minute (for 10-token sequence statistics). Sampling is basically constant time, relying on hashmaps; the example above takes about 0.1ms per character sampled (with zero code optimizations). The (C)EF "model" are a significant inflation of the data size: 1.1MB of data turns into 62.4MB of statistics. But honestly the results aren't that bad. It's junk of course, but on the surface comparable to generative text from a 10M parameter transformer style model applied to the same dataset that trained for 15 minutes on a GPU (tweet here, code here).

The full CLI arg list can be reviewed from

$ python -m efnlp -h

More comprehensive results are detailed in the following tables (see paper/*.pdf for more discussion):

B	# prefixes	unique pref.	# pattern	unique patt.	avg succ./pref.	memory
1	65	0.0%	1,403	0.1%	21.6	3kB
2	1,403	0.1%	11,556	1.0%	8.2	36kB
3	11,556	1.0%	50,712	4.5%	4.4	221kB
4	50,712	4.5%	141,021	12.6%	2.8	876kB
5	141,021	12.6%	283,313	25.4%	2.0	2.5MB
7	447,352	40.1%	609,659	54.7%	1.4	10.1MB
10	858,920	77.0%	937,254	84.0%	1.1	31.9MB
12	991,391	88.9%	1,027,857	92.2%	1.0	50.4MB
15	1,069,423	95.9%	1,081,060	96.9%	1.0	80.6MB
20	1,103,358	98.9%	1,106,345	99.2%	1.0	133MB

What follows are comparative parsing/generation speeds (for sampling 1M tokens):

B	py parse	py t/τ	go parse	go t/τ	c++ parse	c++ t/τ
1	1.0s	1.4μs	49ms	0.1μs	131ms	0.1μs
2	2.0s	1.7μs	109ms	0.1μs	248ms	0.2μs
3	3.3s	2.1μs	222ms	0.2μs	419ms	0.3μs
4	4.3s	2.6μs	361ms	0.3μs	612ms	0.4μs
5	6.4s	3.2μs	585ms	0.5μs	1.1s	0.5μs
7	12.0s	4.9μs	1.2s	0.7μs	2.0s	0.7μs
10	28.0s	7.0μs	2.6s	0.9μs	1.9s	0.8μs
12	37.3s	8.3μs	4.1s	1.1μs	2.5s	1.0μs
15	54.3s	9.7μs	5.2s	1.2μs	3.2s	1.0μs
20	129.0s	12.7μs	8.4s	1.6μs	4.4s	1.3μs

Note (until we get better formatting) that the repeated "parse" (parse time) and "t/τ" (time per token) columns are for python, go, and c++ respectively. With each we can generate text at O(μs)/token. However compiled codes (go and c++) are faster (roughly by an order of magnitude) and seem to scale a bit better with longer sequences. With compiled code we can parse out the (C)EFs in shakespeare in seconds; while python is still "reasonable" it takes a couple minutes to parse 20-token sequences.

Usage

python

See efnlp/__main__.py for an example of running the no-doubt-dumb implementations in efnlp/__init__.py.

c++

See cpp/*, with stuff that should be ready for cmake to run through a build. Embarassingly, pretty much everything is just in main.cpp.

go

See go/*, with stuff that should be ready to build. There's a very brief README.md

Project details

Release history Release notifications | RSS feed

0.1.14

Feb 5, 2023

0.1.12

Feb 2, 2023

0.1.11

Feb 1, 2023

0.1.10

Feb 1, 2023

0.1.9

Feb 1, 2023

0.1.8

Feb 1, 2023

0.1.7

Feb 1, 2023

This version

0.1.6

Feb 1, 2023

0.1.5

Feb 1, 2023

0.1.4

Jan 31, 2023

0.1.3

Jan 31, 2023

0.1.2

Jan 31, 2023

0.1.1

Jan 31, 2023

0.1.0

Jan 31, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

efnlp-0.1.6.tar.gz (15.0 kB view hashes)

Uploaded Feb 1, 2023 Source

Built Distribution

efnlp-0.1.6-py3-none-any.whl (11.5 kB view hashes)

Uploaded Feb 1, 2023 Python 3

Hashes for efnlp-0.1.6.tar.gz

Hashes for efnlp-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`b5a74fbc29bbfb29d89390685992ba913d31f90cf1ace105cd24c445ad9aa732`
MD5	`23eaa72017119282142a6e5baf831b39`
BLAKE2b-256	`dac8f48219c90d73f83f486712835c5ff06620741be2bf6ee01e555dc0ac45d0`

Hashes for efnlp-0.1.6-py3-none-any.whl

Hashes for efnlp-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4dbfcb1e5b9fc84dc3c4c0e8c4a0b17caba9c6ee0b7c438d34ac2a9d2faad3d1`
MD5	`05a38d11ed33f74dcb2af53d1c89ff87`
BLAKE2b-256	`272f067bf629f61f65f02ad21c8e69b9634d3372a7757187d2b5658dc0f054f0`