Skip to main content

Partition Cover Approach to Tokenization

Project description

A Partition Cover Approach to Tokenization


NeurIPS 2025 | arXiv


In this work, we formulate tokenization as an optimization objective, show that it is NP-hard via a simple reduction from vertex cover, and propose a polynomial-time greedy algorithm GreedTok. Our formulation naturally relaxes to the well-studied weighted maximum coverage problem which has a simple $(1 - 1/e)$-approximation greedy algorithm.

Dataset $k$ #uniq words #candidates Time Gain from lazy updates
UN 5K 105,505 884,630 ~6 seconds x23
arXiv 5K 881,233 7,625,530 ~63 seconds x26
Wiki 10K 8,769,943 93,243,449 ~11 minutes x68
PubMed 10K 6,527,614 97,870,366 ~11 minutes x133
Wiki-chinese 10K 7,035,544 69,728,860 ~8.5 minutes x90
Wiki-japanese 10K 2,737,555 60,410,961 ~8.5 minutes x74
Wiki-korean 10K 5,459,833 130,927,124 ~18 minutes x86

Table results shows time to solve (obtain a $k$-sized token set) from word counts. Since most of the compute is front-heavy, solving for larger $k$ size is trivial. For detailed logs, compare cpp_logs/{$data}/{$data}.log versus cpp_logs/{$data}/{$data}_fast.log.

Huggingface AutoTokenizer interface

Install the v0.15 version (for transformers >= 4), for Linux-based:

wget "https://github.com/PreferredAI/pcatt/archive/refs/tags/v0.15.zip"
unzip v0.15.zip -d pcatt
cd pcatt
pip install -r requirements.txt
pip install .

For "training" either:

from pcatt.hf.greedtok import GreedTok
greedtok = GreedTok().train_new_from_iterator(word_iterator, 100, max_token_length=5, min_word_count=1)

or

from pcatt.hf.greedtok import GreedTok
greedtok = GreedTok().train_new_from_counts(word_count_dict, 100, max_token_length=5, min_word_count=1)

To use either:

from pcatt.hf.greedtok import GreedTok
greedtok = GreedTok.from_pretrained(greedtok_file_directory)

or

import pcatt.hf
greedtok = AutoTokenizer.from_pretrained("greedtok_file_directory")

Refer to eval_hf.ipynb for examples and tips. Note that the code in pcatt.hf is Apache 2.0 (following Huggingface Tokenizers).

Evaluations in eval_notebook.ipynb.

Citation

@inproceedings{lim2025partitioncoverapproachtokenization,
      title={A Partition Cover Approach to Tokenization},
      author={Jia Peng Lim and Shawn Tan and Davin Choo and Hady W. Lauw},
      year={2025}
      booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

greedtok-0.15-cp314-cp314-manylinux_2_34_x86_64.whl (7.3 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.34+ x86-64

File details

Details for the file greedtok-0.15-cp314-cp314-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for greedtok-0.15-cp314-cp314-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 c6c7c9aaf4d2e836ccb3a8d5d63a2008a8027a75bdf1e3471d12175bf43776fd
MD5 891d592806fdcc5f174e47054c346712
BLAKE2b-256 f9f98ccfdffad1a00978289ae9c8a715ad7fe5f9763e9779a38dfeee33ec8111

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page