Skip to main content

Efficiently computing & storing token n-grams from large corpora

Project description

Tokengrams

Tokengrams allows you to efficiently compute $n$-gram statistics for pre-tokenized text corpora used to train large language models. It does this not by explicitly pre-computing the $n$-gram counts for fixed $n$, but by creating a suffix array index which allows you to efficiently compute the count of an $n$-gram on the fly for any $n$.

Our code also allows you to turn your suffix array index into an efficient $n$-gram language model, which can be used to generate text or compute the perplexity of a given text.

The backend is written in Rust, and the Python bindings are generated using PyO3.

Installation

pip install tokengrams

Usage

Preparing data

Use a dataset of u16 or u32 tokens, or prepare one from a HuggingFace dataset.

# Get pre-tokenized dataset
from huggingface_hub import HfApi, hf_hub_download

hf_hub_download(
  repo_id="EleutherAI/pile-standard-pythia-preshuffled", 
  repo_type="dataset", 
  filename="document-00000-of-00020.bin", 
  local_dir="."
)
# Tokenize HF dataset
from tokengrams import tokenize_hf_dataset
from datasets import load_dataset
from transformers import AutoTokenizer

tokenize_hf_dataset(
    dataset=load_dataset("EleutherAI/lambada_openai", "en"),
    tokenizer=AutoTokenizer.from_pretrained("EleutherAI/pythia-160m"),
    output_path="lambada.bin",
    text_key="text",
    append_eod=True,
    workers=1,
)

Building an index

from tokengrams import MemmapIndex

# Create a new index from an on-disk corpus of u16 tokens and save it to a .idx file. 
# Set verbose to true to include a progress bar for the index sort.
index = MemmapIndex.build(
    "document-00000-of-00020.bin",
    "document-00000-of-00020.idx",
    vocab=2**16,
    verbose=True
)

# True for any valid index.
print(index.is_sorted())
  
# Get the count of "hello world" in the corpus.
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-160m")
print(index.count(tokenizer.encode("hello world")))

# You can now load the index from disk later using __init__
index = MemmapIndex(
    "document-00000-of-00020.bin",
    "document-00000-of-00020.idx",
    vocab=2**16
)

Using an index

# Count how often each token in the corpus succeeds "hello world".
print(index.count_next(tokenizer.encode("hello world")))

# Parallelise over queries
print(index.batch_count_next(
    [tokenizer.encode("hello world"), tokenizer.encode("hello universe")]
))

# Autoregressively sample 10 tokens using 5-gram language statistics. Initial
# gram statistics are derived from the query, with lower order gram statistics used 
# until the sequence contains at least 5 tokens.
print(index.sample(tokenizer.encode("hello world"), n=5, k=10))

# Parallelize over sequence generations
print(index.batch_sample(tokenizer.encode("hello world"), n=5, k=10, num_samples=20))

# Query whether the corpus contains "hello world"
print(index.contains(tokenizer.encode("hello world")))

# Get all n-grams beginning with "hello world" in the corpus
print(index.positions(tokenizer.encode("hello world")))

Scaling

Corpora small enough to fit in memory can use an InMemoryIndex:

from tokengrams import InMemoryIndex

tokens = [0, 1, 2, 3, 4]
index = InMemoryIndex(tokens, vocab=5)

Larger corpora must use a MemmapIndex.

Many systems struggle with memory mapping extremely large tables (e.g. 40 billion tokens), causing unexpected bus errors. To prevent this split the corpus into shards then use a ShardedMemmapIndex to sort and query the table shard by shard:

from tokengrams import ShardedMemmapIndex
from huggingface_hub import HfApi, hf_hub_download

files = [
    file for file in HfApi().list_repo_files("EleutherAI/pile-standard-pythia-preshuffled", repo_type="dataset")
    if file.endswith('.bin')
]

index_paths = []
for file in files:
    hf_hub_download("EleutherAI/pile-standard-pythia-preshuffled", repo_type="dataset", filename=file, local_dir=".")
    index_paths.append((file, f'{file.rstrip(".bin")}.idx'))

index = ShardedMemmapIndex.build(index_paths, vocab=2**16, verbose=True)

Tokens

Tokengrams builds indices from on-disk corpora of either u16 or u32 tokens, supporting a maximum vocabulary size of 232. In practice, however, vocabulary size is limited by the length of the largest word size vector the machine can allocate in memory.

Corpora with vocabulary sizes smaller than 216 must use u16 tokens.

Performance

Index build times for in-memory corpora scale inversely with the number of available CPU threads, whereas if the index reads from or writes to a file it is likely to be IO bound.

The time complexities of count_next(query) and sample_unsmoothed(query) are O(n log n), where n is ~ the number of completions for the query. The time complexity of sample_smoothed(query) is O(m n log n) where m is the n-gram order.

Sample build times for an IO bound index Sample count_next times for an IO bound index

Development

cargo build
cargo test

Develop Python bindings:

pip install maturin
maturin develop
pytest

Support

The best way to get support is to open an issue on this repo or post in #interp-across-time in the EleutherAI Discord server. If you've used the library and have had a positive (or negative) experience, we'd love to hear from you!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokengrams-0.3.2.tar.gz (1.2 MB view details)

Uploaded Source

Built Distributions

tokengrams-0.3.2-cp311-none-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.11 Windows x86-64

tokengrams-0.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

tokengrams-0.3.2-cp311-cp311-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.3 MB view details)

Uploaded CPython 3.11 macOS 10.12+ universal2 (ARM64, x86-64) macOS 10.12+ x86-64 macOS 11.0+ ARM64

tokengrams-0.3.2-cp310-none-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.10 Windows x86-64

tokengrams-0.3.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.8 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

tokengrams-0.3.2-cp310-cp310-macosx_10_12_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.10 macOS 10.12+ x86-64

tokengrams-0.3.2-cp310-cp310-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.3 MB view details)

Uploaded CPython 3.10 macOS 10.12+ universal2 (ARM64, x86-64) macOS 10.12+ x86-64 macOS 11.0+ ARM64

tokengrams-0.3.2-cp39-none-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.9 Windows x86-64

tokengrams-0.3.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

File details

Details for the file tokengrams-0.3.2.tar.gz.

File metadata

  • Download URL: tokengrams-0.3.2.tar.gz
  • Upload date:
  • Size: 1.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.7.0

File hashes

Hashes for tokengrams-0.3.2.tar.gz
Algorithm Hash digest
SHA256 11552f62170db2a623916eb64cb50b88155ce413d76abd901fbfc608c71ab453
MD5 e339ede8f9451ce479290eb68b2a7a12
BLAKE2b-256 b465fda7d7e54859fdad35ed95d4c3dd5028ee09af40bbccf9bd25aecfc8fb3c

See more details on using hashes here.

Provenance

File details

Details for the file tokengrams-0.3.2-cp311-none-win_amd64.whl.

File metadata

File hashes

Hashes for tokengrams-0.3.2-cp311-none-win_amd64.whl
Algorithm Hash digest
SHA256 283f1160d65922caafd27cccb309e2d52694bf2047f420fc4c3bc92a5dce6243
MD5 652912ad80225d9084a870a4a7dc2036
BLAKE2b-256 bfea5c2df20a0aa3167d67ddb5adb335a96e3787b8f3c150a120f8d28f5a3f73

See more details on using hashes here.

Provenance

File details

Details for the file tokengrams-0.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokengrams-0.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 03d468d5782fa4f0d59618cfe2bec3fbbde97cb33e7ba669300b852c6c788c71
MD5 cbaecd00ae7d45e4c3a334eaf9151964
BLAKE2b-256 7ffb802f7ac74b89ffa762a1e573e5f6841344a2c1f3c1e9c4639009558ff313

See more details on using hashes here.

Provenance

File details

Details for the file tokengrams-0.3.2-cp311-cp311-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for tokengrams-0.3.2-cp311-cp311-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 2819db872e8811348fe6808de8589e25ac588329ef13c1958102b2c9ba4a8d52
MD5 2effca91ef02e9f40ee3668e59b9ec46
BLAKE2b-256 b4849aa618f0e330ae950ede65f56f6e0c316714caa89a223ca111d6951a4e47

See more details on using hashes here.

Provenance

File details

Details for the file tokengrams-0.3.2-cp310-none-win_amd64.whl.

File metadata

File hashes

Hashes for tokengrams-0.3.2-cp310-none-win_amd64.whl
Algorithm Hash digest
SHA256 10860fb51edc2a8d26dac219a5948f6df4695031c01598f7da6befe0dfff7d65
MD5 d4c84065d1fec642fc4427ae2a267de4
BLAKE2b-256 c661227a5bdf77ff3e83a882813e15ba059f9a1d5360e982498f371353869618

See more details on using hashes here.

Provenance

File details

Details for the file tokengrams-0.3.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for tokengrams-0.3.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 b3c8ffcc01012c7e7649b68247ba625c7c2ad642ae2ee0bbf4b089602de0c76b
MD5 77aa55da126b90f26832865694adf566
BLAKE2b-256 8b3e22c012b6f0e9fff205cb4bf7f32d3ec85b5e608d2ace708fa8d0a92e0948

See more details on using hashes here.

Provenance

File details

Details for the file tokengrams-0.3.2-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tokengrams-0.3.2-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 c70118356639f454628f6acb8fb60bc561f1d73dbf254e95150cc9b9850ca8a9
MD5 23203b54897d7257a4f36fa0947f1770
BLAKE2b-256 aef5887cc8655279c72068e3920b578c8076217a04ca659bfe69b728c8c0c8b2

See more details on using hashes here.

Provenance

File details

Details for the file tokengrams-0.3.2-cp310-cp310-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for tokengrams-0.3.2-cp310-cp310-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 2a970b13ef2e4fbf2855241cdcc1f42ddb36932911032905cdc4ee1696e2c2cd
MD5 d53095412249002922beaad1cacd5d4f
BLAKE2b-256 ca959ab9d23de8a6b711f6f16cc6b2f9f7a9601acb69e2872c2327136ec7ec34

See more details on using hashes here.

Provenance

File details

Details for the file tokengrams-0.3.2-cp39-none-win_amd64.whl.

File metadata

File hashes

Hashes for tokengrams-0.3.2-cp39-none-win_amd64.whl
Algorithm Hash digest
SHA256 9386c0f59ee8de312acbf2c9b07a5fed7b6d7f9598d322bde13027721c29571a
MD5 bf47a5264fa190de9005e86e007ba7a1
BLAKE2b-256 9c910360507138ee10eeff7c7e3aa516244aced702feebdae9ce707c62d4ca92

See more details on using hashes here.

Provenance

File details

Details for the file tokengrams-0.3.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tokengrams-0.3.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 68e8b495ad52bdc75c79d3db4d7132a2dc93bc06b3bc42447b901059a6e2f7a6
MD5 673c378e1a3933ed3cd2a86c762c9577
BLAKE2b-256 cdba06830e9a5327ca5bccf47680bc6f584e01527f5ad41b080476f64d1a84ea

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page