Skip to main content

BPE modification that removes sparsely used intermediate tokens during vocabularisation.

Project description

BPE and PickyBPE

Python package for object-oriented BPE vocabularisation, with an extension for PickyBPE. Used as the preferred BPE vocabulariser in TkTkT. Adapted from Pavel Chizhov's PickyBPE trainer,


Original README

This repository contains a prototype code for the paper "BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training", which was presented at EMNLP 2024.

[ACL Anthology] [arXiv] [BibTeX]

Training

For training you should use train.py script. For example, the following command trains a Picky BPE tokenizer with vocabulary size 8192 and IoS threshold of 0.9.

$ python scripts/train.py --input_file train.txt --model_file model.json --vocab_size 8192 --threshold 0.9

The complete list of options is:

Args:
  --input_file     Path to the training corpus
  --model_file     Path to save the model
  --vocab_size     Desired vocabulary size
  --threshold      Desired IoS threshold
  --coverage       Relative symbol coverage for the initial vocabulary (default: 0.9999)
  --pad_id         PAD token id (default: 0)
  --unk_id         UNK token id (default: 1)
  --bos_id         BOS token id (default: 2)
  --eos_id         EOS token id (default: 3)
  --logging_step   Frequency of merges logging (default: 200)

Tokenization

To apply the trained Picky BPE model, use the segment.py script. For example:

$ python scripts/segment.py --bpe_model model.json --input_file train.txt --output_file train.tok.txt

The complete list of options is:

Args:
  --model_file    Path to the trained model
  --input_file    Path to the raw corpus
  --output_file   Path to save the tokenized corpus
  --return_type   Whether to output tokens ("str") or ids ("int") (default: "str")

Referencing

To cite PickyBPE:

@inproceedings{chizhov-etal-2024-bpe,
    title = "{BPE} Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training",
    author = "Chizhov, Pavel  and
      Arnett, Catherine  and
      Korotkova, Elizaveta  and
      Yamshchikov, Ivan P.",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-main.925",
    pages = "16587--16604",
    abstract = "Language models can greatly benefit from efficient tokenization. However, they still mostly utilize the classical Byte-Pair Encoding (BPE) algorithm, a simple and reliable method. BPE has been shown to cause such issues as under-trained tokens and sub-optimal compression that may affect the downstream performance. We introduce PickyBPE, a modified BPE algorithm that carries out vocabulary refinement during tokenizer training by removing merges that leave intermediate {``}junk{''} tokens. Our method improves vocabulary efficiency, eliminates under-trained tokens, and does not compromise text compression. Our experiments show that this method either improves downstream performance or does not harm it.",
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pickybpe_bauwenst-1.3.0.tar.gz (16.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pickybpe_bauwenst-1.3.0-py3-none-any.whl (16.0 kB view details)

Uploaded Python 3

File details

Details for the file pickybpe_bauwenst-1.3.0.tar.gz.

File metadata

  • Download URL: pickybpe_bauwenst-1.3.0.tar.gz
  • Upload date:
  • Size: 16.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.5 cpython/3.13.12 HTTPX/0.28.1

File hashes

Hashes for pickybpe_bauwenst-1.3.0.tar.gz
Algorithm Hash digest
SHA256 03dfac0cdd22be4099c9de28a6bc5b3fec444255b1621e599b9f1f690b9af77d
MD5 c9346e0e0b9b7d1eba3cf456915503de
BLAKE2b-256 9fd87c46c1f7ef2047754352c9bdca885cc20a432372aef8cc6208a348df9152

See more details on using hashes here.

File details

Details for the file pickybpe_bauwenst-1.3.0-py3-none-any.whl.

File metadata

  • Download URL: pickybpe_bauwenst-1.3.0-py3-none-any.whl
  • Upload date:
  • Size: 16.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.5 cpython/3.13.12 HTTPX/0.28.1

File hashes

Hashes for pickybpe_bauwenst-1.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ffc3beabdbea47d73b7ee953a6ea340f691e4d8b4c6ce707f74dd4e14bd88f42
MD5 194af10cd13b236ce42c04e007ccc0e5
BLAKE2b-256 57e11cd6e31e1eca7dd764ef022ebf7b9d605d23b9540d3809aadfe30a1e18f8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page