Skip to main content

Implementation of BPE-knockout, a morphologically informed post-processing step for BPE tokenisers.

Project description

Repo hosting all the code used for the NAACL 2024 paper "BPE-knockout: Pruning Pre-existing BPE Tokenisers with Backwards-compatible Morphological Semi-supervision".

Below are the instructions for reproducing and extending the intrinsic evaluations. Extrinsic evaluations are done with RobBERT's framework. The pre-trained model checkpoints are available on the HuggingFace Hub.

HuggingFace compatibility

If you are used to working with the HuggingFace suite for language modelling and tokenisation, this is your lucky day! You can incorporate BPE-knockout anywhere you're already using a BPE tokeniser loaded from HuggingFace, with only 2 extra imports and 2 more lines of code. For example, if you're using roberta-base's English tokeniser, you would run:

# Load HuggingFace object
from transformers import AutoTokenizer
hf_bpe_tokeniser = AutoTokenizer.from_pretrained("roberta-base")

# Construct TkTkT object
from tktkt.models.bpe.knockout import BPEKnockout
tktkt_bpek_tokeniser = BPEKnockout.fromHuggingFace(hf_bpe_tokeniser, "English")

# Convert back to HuggingFace
from tktkt.interfaces.huggingface import TktktToHuggingFace
hf_bpek_tokeniser = TktktToHuggingFace(tktkt_bpek_tokeniser, specials_from=hf_bpe_tokeniser)

The resulting object is indeed a HuggingFace tokeniser, but internally it works using BPE-knockout.

Installing

Minimal package

If you are only interested in using the BPE-knockout package (including our English, German and Dutch BPE tokenisers and the respective morphological data loaders, but not including corpus word counts) and not in running the experiments from the paper, you likely just want to run:

pip install "bpe_knockout[github] @ git+https://github.com/bauwenst/BPE-knockout.git"

As shown in the above example, user-friendly encapsulations for BPE-knockout are provided by the TkTkT package, which may be more interesting to you than the core algorithm and configuration code which is provided here. In any case, installing either package will install the other automatically anyway.

Full experiments, editable code

If you want to run experiments from the paper and/or have access to the word count files, this means you want to download everything in this repository and tell Python to use the folder into which you cloned for the package code, rather than copying the code to your global or virtual site-packages directory. In that case, run:

git clone https://github.com/bauwenst/BPE-knockout.git
cd BPE-knockout
pip install -e .[github]

Warning:

  • If you're using conda or venv, don't forget to activate your environment before running any calls to pip install.
  • If you have an editable installation of my other packages TkTkT and/or Fiject and would like to keep it, do not include the [github] suffix.

Running experiments

Given that you have an editable install, follow these steps to reproduce the paper results:

  1. Unzip the .rar file under data/compressed/.
  2. Run py tst/main.py or python tst/main.py in a terminal.

Using your own data

It is possible to use other datasets (even other languages) than the ones used for the paper. Here is how you would do that:

  1. Make sure you have the following files:
    1. A word-count tab-separated file from a sufficiently large corpus;
    2. A file with morphological decompositions (not necessarily of the same words);
    3. Optional: if you don't want to generate a new BPE tokeniser from your word counts, the file(s) that specify your existing BPE tokeniser.
  2. If your morphological decompositions are not in CELEX format, you still need to write your own parser for the morphology file. Do this in src/bpe_knockout/datahandlers/morphology.py by creating a subclass of the abstract LemmaMorphology class.
  3. In src/bpe_knockout/project/config.py, create a new function that creates a ProjectConfig object declaring the paths to all the relevant files, as well as the name of the relevant LemmaMorphology subclass. Use the setup() functions as examples.
  4. In main.py, import this new config.

Data licenses

All data is included in the repo, because it is obtainable for free elsewhere and free of license too.

Citation

If you use BPE-knockout in your own work, cite the paper using e.g.:

@inproceedings{bauwens-delobelle-2024-bpe,
    title = "{BPE}-knockout: Pruning Pre-existing {BPE} Tokenisers with Backwards-compatible Morphological Semi-supervision",
    author = "Bauwens, Thomas  and  Delobelle, Pieter",
    booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
    month = jun,
    year = "2024",
    address = "Mexico City, Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.naacl-long.324",
    pages = "5810--5832"
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bpe_knockout-2024.8.1.tar.gz (74.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bpe_knockout-2024.8.1-py3-none-any.whl (78.6 kB view details)

Uploaded Python 3

File details

Details for the file bpe_knockout-2024.8.1.tar.gz.

File metadata

  • Download URL: bpe_knockout-2024.8.1.tar.gz
  • Upload date:
  • Size: 74.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.5 cpython/3.13.12 HTTPX/0.28.1

File hashes

Hashes for bpe_knockout-2024.8.1.tar.gz
Algorithm Hash digest
SHA256 23e41663636416bca443cc6fd889eb8f80cb1d17dd6e6e21aa1c4d96282c7ca5
MD5 6f19ce1996732253e7e1257651cd8118
BLAKE2b-256 6bc9157c69e3b3b396abc51bb078f8709d51e093c62cfb3e88d736a3793815b4

See more details on using hashes here.

File details

Details for the file bpe_knockout-2024.8.1-py3-none-any.whl.

File metadata

  • Download URL: bpe_knockout-2024.8.1-py3-none-any.whl
  • Upload date:
  • Size: 78.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.5 cpython/3.13.12 HTTPX/0.28.1

File hashes

Hashes for bpe_knockout-2024.8.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b3b3954423239ab232cdc7bbaa8abdc1c63805b6daa1a5d0bc7d8de5b5d8d816
MD5 bab92c3c75f2f4ce6dec1179d2cba511
BLAKE2b-256 71e208d8b2d77f8a90395fab19f706aaa24850a449e54af3782351ebc2530a21

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page