Partition Cover Approach to Tokenization
Project description
A Partition Cover Approach to Tokenization
In this work, we formulate tokenization as an optimization objective, show that it is NP-hard via a simple reduction from vertex cover, and propose a polynomial-time greedy algorithm GreedTok. Our formulation naturally relaxes to the well-studied weighted maximum coverage problem which has a simple $(1 - 1/e)$-approximation greedy algorithm.
| Dataset | $k$ | #uniq words | #candidates | Time | Gain from lazy updates |
|---|---|---|---|---|---|
| UN | 5K | 105,505 | 884,630 | ~6 seconds | x23 |
| arXiv | 5K | 881,233 | 7,625,530 | ~63 seconds | x26 |
| Wiki | 10K | 8,769,943 | 93,243,449 | ~11 minutes | x68 |
| PubMed | 10K | 6,527,614 | 97,870,366 | ~11 minutes | x133 |
| Wiki-chinese | 10K | 7,035,544 | 69,728,860 | ~8.5 minutes | x90 |
| Wiki-japanese | 10K | 2,737,555 | 60,410,961 | ~8.5 minutes | x74 |
| Wiki-korean | 10K | 5,459,833 | 130,927,124 | ~18 minutes | x86 |
Table results shows time to solve (obtain a $k$-sized token set) from word counts. Since most of the compute is front-heavy, solving for larger $k$ size is trivial. For detailed logs, compare cpp_logs/{$data}/{$data}.log versus cpp_logs/{$data}/{$data}_fast.log.
Huggingface AutoTokenizer interface
Install the v0.15 version (for transformers >= 4), for Linux-based:
wget "https://github.com/PreferredAI/pcatt/archive/refs/tags/v0.15.zip"
unzip v0.15.zip -d pcatt
cd pcatt
pip install -r requirements.txt
pip install .
For "training" either:
from pcatt.hf.greedtok import GreedTok
greedtok = GreedTok().train_new_from_iterator(word_iterator, 100, max_token_length=5, min_word_count=1)
or
from pcatt.hf.greedtok import GreedTok
greedtok = GreedTok().train_new_from_counts(word_count_dict, 100, max_token_length=5, min_word_count=1)
To use either:
from pcatt.hf.greedtok import GreedTok
greedtok = GreedTok.from_pretrained(greedtok_file_directory)
or
import pcatt.hf
greedtok = AutoTokenizer.from_pretrained("greedtok_file_directory")
Refer to eval_hf.ipynb for examples and tips. Note that the code in pcatt.hf is Apache 2.0 (following Huggingface Tokenizers).
Evaluations in eval_notebook.ipynb.
Citation
@inproceedings{lim2025partitioncoverapproachtokenization,
title={A Partition Cover Approach to Tokenization},
author={Jia Peng Lim and Shawn Tan and Davin Choo and Hady W. Lauw},
year={2025}
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file greedtok-0.15-cp314-cp314-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: greedtok-0.15-cp314-cp314-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 7.3 MB
- Tags: CPython 3.14, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c6c7c9aaf4d2e836ccb3a8d5d63a2008a8027a75bdf1e3471d12175bf43776fd
|
|
| MD5 |
891d592806fdcc5f174e47054c346712
|
|
| BLAKE2b-256 |
f9f98ccfdffad1a00978289ae9c8a715ad7fe5f9763e9779a38dfeee33ec8111
|