Partition Cover Approach to Tokenization
Project description
A partition cover approach to tokenization
In this work, we formulate tokenization as an optimization objective, show that it is NP-hard via a simple reduction from vertex cover, and propose a polynomial-time greedy algorithm GreedTok. Our formulation naturally relaxes to the well-studied weighted maximum coverage problem which has a simple $(1 - 1/e)$-approximation greedy algorithm.
GreedTok
-
If using python wrapper
a. Using pip (use the lightweight source code w/o data/notebooks):
wget "https://github.com/PreferredAI/pcatt/archive/refs/tags/v0.13.tar.gz" unzip pcatt-0.13.zip -d pcatt cd pcatt pip install -r requirements.txt pip install .b. Or compile manually e.g. (have to specify links)
c++ -O3 -Wall -shared -std=c++20 \ -fPIC $(python3 -m pybind11 --includes) \ -I$CONDA_PREFIX/include/ \ -I$CONDA_PREFIX/include/tbb \ -I$CONDA_PREFIX/include/oneapi \ -L$CONDA_PREFIX/lib/ \ -l tbb \ ./pcatt/greedy_builder.cpp \ -o ./pcatt/greedy_builder$(python3-config --extension-suffix)c. import and use! Examples in eval_tokenizer_example.ipynb
-
If using C++ files directly
a. Install dependencies for C++ code, we use oneTBB to parallelize the code, simplest way is to use Conda or pip:
conda install tbb-develb. Compile greedy_cache.py e.g.:
c++ -O3 -std=c++20 \ -I$CONDA_PREFIX/include/ \ -I$CONDA_PREFIX/include/tbb \ -I$CONDA_PREFIX/include/oneapi \ -L$CONDA_PREFIX/lib/ \ -l tbb \ pcatt/greedy_cache.cpp \ -o pcatt/greedy.exec. Prepare inputs (refer to cpp_inputs for examples):
- counts: a file with '\n' delimited integers
- words: a file with ' ' (space) delimited words
d. Run compiled program (currently looks for domain inputs in fixed path under cpp_inputs/*)
./greedy.exe <domain> <k>e. Now we obtained our ranked token sequence (refer to cpp_outputs for examples):- merges: the number of covers at each step, delimited by '\n'
- tokens: byte sequences in hex-format, delimited by '\n'
Evaluations in eval_notebook.ipynb
Citation
TBD
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file greedtok-0.13.tar.gz.
File metadata
- Download URL: greedtok-0.13.tar.gz
- Upload date:
- Size: 10.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eb3a1b8fa34530563da85f5fef39fafb513a53f88df99d531398c15391e038d5
|
|
| MD5 |
b13b0bdbecbcf8e8f930eb116fa7b6fd
|
|
| BLAKE2b-256 |
96c531250566e1c0af17f671dd65038f3b7c45b9fb66c331f56a244ec73caef7
|