Partition Cover Approach to Tokenization

Project description

A Partition Cover Approach to Tokenization

NeurIPS 2025 | arXiv

In this work, we formulate tokenization as an optimization objective, show that it is NP-hard via a simple reduction from vertex cover, and propose a polynomial-time greedy algorithm GreedTok. Our formulation naturally relaxes to the well-studied weighted maximum coverage problem which has a simple $(1 - 1/e)$-approximation greedy algorithm.

Dataset	$k$	#uniq words	#candidates	Time	Gain from lazy updates
UN	5K	105,505	884,630	~6 seconds	x23
arXiv	5K	881,233	7,625,530	~63 seconds	x26
Wiki	10K	8,769,943	93,243,449	~11 minutes	x68
PubMed	10K	6,527,614	97,870,366	~11 minutes	x133
Wiki-chinese	10K	7,035,544	69,728,860	~8.5 minutes	x90
Wiki-japanese	10K	2,737,555	60,410,961	~8.5 minutes	x74
Wiki-korean	10K	5,459,833	130,927,124	~18 minutes	x86

Table results shows time to solve (obtain a $k$-sized token set) from word counts. Since most of the compute is front-heavy, solving for larger $k$ size is trivial. For detailed logs, compare cpp_logs/{$data}/{$data}.log versus cpp_logs/{$data}/{$data}_fast.log.

Huggingface AutoTokenizer interface

Install the v0.15 version (for transformers >= 4), for Linux-based:

wget "https://github.com/PreferredAI/pcatt/archive/refs/tags/v0.15.zip"
unzip v0.15.zip -d pcatt
cd pcatt
pip install -r requirements.txt
pip install .

For "training" either:

from pcatt.hf.greedtok import GreedTok
greedtok = GreedTok().train_new_from_iterator(word_iterator, 100, max_token_length=5, min_word_count=1)

from pcatt.hf.greedtok import GreedTok
greedtok = GreedTok().train_new_from_counts(word_count_dict, 100, max_token_length=5, min_word_count=1)

To use either:

from pcatt.hf.greedtok import GreedTok
greedtok = GreedTok.from_pretrained(greedtok_file_directory)

import pcatt.hf
greedtok = AutoTokenizer.from_pretrained("greedtok_file_directory")

Refer to eval_hf.ipynb for examples and tips. Note that the code in pcatt.hf is Apache 2.0 (following Huggingface Tokenizers).

Evaluations in eval_notebook.ipynb.

Citation

@inproceedings{lim2025partitioncoverapproachtokenization,
      title={A Partition Cover Approach to Tokenization},
      author={Jia Peng Lim and Shawn Tan and Davin Choo and Hady W. Lauw},
      year={2025}
      booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
}

Project details

Release history Release notifications | RSS feed

This version

0.15

Apr 16, 2026

0.14.1

May 13, 2025

0.14

Apr 17, 2025

0.13

Jan 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

greedtok-0.15-cp314-cp314-manylinux_2_34_x86_64.whl (7.3 MB view details)

Uploaded Apr 16, 2026 CPython 3.14manylinux: glibc 2.34+ x86-64

File details

Details for the file greedtok-0.15-cp314-cp314-manylinux_2_34_x86_64.whl.

File metadata

Download URL: greedtok-0.15-cp314-cp314-manylinux_2_34_x86_64.whl
Upload date: Apr 16, 2026
Size: 7.3 MB
Tags: CPython 3.14, manylinux: glibc 2.34+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for greedtok-0.15-cp314-cp314-manylinux_2_34_x86_64.whl
Algorithm	Hash digest
SHA256	`c6c7c9aaf4d2e836ccb3a8d5d63a2008a8027a75bdf1e3471d12175bf43776fd`
MD5	`891d592806fdcc5f174e47054c346712`
BLAKE2b-256	`f9f98ccfdffad1a00978289ae9c8a715ad7fe5f9763e9779a38dfeee33ec8111`

See more details on using hashes here.

greedtok 0.15

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta