No project description provided
Project description
Tokengrams
This library allows you to efficiently compute $n$-gram statistics for pre-tokenized text corpora used to train large language models. It does this not by explicitly pre-computing the $n$-gram counts for fixed $n$, but by creating a suffix array index which allows you to efficiently compute the count of an $n$-gram on the fly for any $n$.
Our code also allows you to turn your suffix array index into an efficient $n$-gram language model, which can be used to generate text or compute the perplexity of a given text.
The backend is written in Rust, and the Python bindings are generated using PyO3.
Installation
Currently you need to build and install from source using maturin
. We plan to release wheels on PyPI soon.
pip install maturin
maturin develop
Usage
from tokengrams import MemmapIndex
# Create a new index from an on-disk corpus called `document.bin` and save it to
# `pile.idx`
index = MemmapIndex.build(
"/mnt/ssd-1/pile_preshuffled/standard/document.bin",
"/mnt/ssd-1/nora/pile.idx",
)
# Get the count of "hello world" in the corpus
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-160m")
print(index.count(tokenizer.encode("hello world")))
# You can now load the index from disk later using __init__
index = MemmapIndex(
"/mnt/ssd-1/pile_preshuffled/standard/document.bin",
"/mnt/ssd-1/nora/pile.idx"
)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for tokengrams-0.2.0-cp310-cp310-manylinux_2_31_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 471702659984e5761d1f01e52872aec67ca9e6fecfecc2a13a42f5bf2680e89e |
|
MD5 | 09dd5e8738a13945a06b6c62c488376c |
|
BLAKE2b-256 | 70118987664dcb4b18141a4c11f62f9bf919298066ce88d6a925229bdf41fe94 |