BPE modification that removes sparsely used intermediate tokens during vocabularisation.
Project description
BPE and PickyBPE
Python package for object-oriented BPE vocabularisation, with an extension for PickyBPE. Used as the preferred BPE vocabulariser in TkTkT. Adapted from Pavel Chizhov's PickyBPE trainer,
Original README
This repository contains a prototype code for the paper "BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training", which was presented at EMNLP 2024.
[ACL Anthology] [arXiv] [BibTeX]
Training
For training you should use train.py script. For example, the following command trains a
Picky BPE tokenizer with vocabulary size 8192 and IoS threshold of 0.9.
$ python scripts/train.py --input_file train.txt --model_file model.json --vocab_size 8192 --threshold 0.9
The complete list of options is:
Args:
--input_file Path to the training corpus
--model_file Path to save the model
--vocab_size Desired vocabulary size
--threshold Desired IoS threshold
--coverage Relative symbol coverage for the initial vocabulary (default: 0.9999)
--pad_id PAD token id (default: 0)
--unk_id UNK token id (default: 1)
--bos_id BOS token id (default: 2)
--eos_id EOS token id (default: 3)
--logging_step Frequency of merges logging (default: 200)
Tokenization
To apply the trained Picky BPE model, use the segment.py script. For example:
$ python scripts/segment.py --bpe_model model.json --input_file train.txt --output_file train.tok.txt
The complete list of options is:
Args:
--model_file Path to the trained model
--input_file Path to the raw corpus
--output_file Path to save the tokenized corpus
--return_type Whether to output tokens ("str") or ids ("int") (default: "str")
Referencing
To cite PickyBPE:
@inproceedings{chizhov-etal-2024-bpe,
title = "{BPE} Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training",
author = "Chizhov, Pavel and
Arnett, Catherine and
Korotkova, Elizaveta and
Yamshchikov, Ivan P.",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.emnlp-main.925",
pages = "16587--16604",
abstract = "Language models can greatly benefit from efficient tokenization. However, they still mostly utilize the classical Byte-Pair Encoding (BPE) algorithm, a simple and reliable method. BPE has been shown to cause such issues as under-trained tokens and sub-optimal compression that may affect the downstream performance. We introduce PickyBPE, a modified BPE algorithm that carries out vocabulary refinement during tokenizer training by removing merges that leave intermediate {``}junk{''} tokens. Our method improves vocabulary efficiency, eliminates under-trained tokens, and does not compromise text compression. Our experiments show that this method either improves downstream performance or does not harm it.",
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pickybpe_bauwenst-1.3.1.tar.gz.
File metadata
- Download URL: pickybpe_bauwenst-1.3.1.tar.gz
- Upload date:
- Size: 16.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: Hatch/1.16.5 cpython/3.13.12 HTTPX/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
061db43543350886109d8415dfb757db2667e67f06f2b6d17450064bf8fc0a49
|
|
| MD5 |
edc8e002ede0cfb743c35e01a97145af
|
|
| BLAKE2b-256 |
26ad6bfcc7b177955f996e88f5714044a68e84be9c5137134766172b71f20125
|
File details
Details for the file pickybpe_bauwenst-1.3.1-py3-none-any.whl.
File metadata
- Download URL: pickybpe_bauwenst-1.3.1-py3-none-any.whl
- Upload date:
- Size: 16.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: Hatch/1.16.5 cpython/3.13.12 HTTPX/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
510ce056a2a13229a3f130ecfb2e08089ec89ca25d8f54de77171109cd7ee6ac
|
|
| MD5 |
2d54fc1cfa2ae505d8b8a6f22e47f0e0
|
|
| BLAKE2b-256 |
71f4eeb46ad0a0d6cc95b99fea798179bcff4e8176ad079a728650808c123759
|