Skip to main content

Implementation of BPE-knockout, a morphologically informed post-processing step for BPE tokenisers.

Project description

BPE-knockout

Repo hosting all the code used for the BPE-knockout paper. Below are the instructions for reproducing and extending the intrinsic evaluations. Extrinsic evaluations are done with RobBERT's framework.

Data

All data is included in the repo, because it is obtainable for free elsewhere and free of license too.

Running

  1. Unzip the .rar file under data/compressed/.
  2. Run py main.py or python main.py in a terminal.

Using your own data

It is possible to use other datasets (even other languages) than the ones used for the paper. Here is how you would do that:

  1. Make sure you have the following files:
    1. A word count file from a sufficiently large corpus;
    2. A file with morphological decompositions (not necessarily of the same words);
    3. Optional: if you don't want to generate a new BPE tokeniser from your word counts, the file(s) that specify your existing BPE tokeniser.
  2. If your morphological decompositions are not in CELEX format, you still need to write your own parser for the morphology file. Do this in src/datahandlers/morphology.py by creating a subclass of the abstract LemmaMorphology class.
  3. In src/auxiliary/config.py, create a new function that creates a ProjectConfig object declaring the paths to all the relevant files, as well as the name of the relevant LemmaMorphology subclass. Use the setup() functions as examples.
  4. In main.py, specify this new config.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bpe_knockout-2024.2.1.tar.gz (67.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bpe_knockout-2024.2.1-py3-none-any.whl (73.6 kB view details)

Uploaded Python 3

File details

Details for the file bpe_knockout-2024.2.1.tar.gz.

File metadata

  • Download URL: bpe_knockout-2024.2.1.tar.gz
  • Upload date:
  • Size: 67.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.5 cpython/3.13.12 HTTPX/0.28.1

File hashes

Hashes for bpe_knockout-2024.2.1.tar.gz
Algorithm Hash digest
SHA256 43c04e58b9b10215fa139f6333714f2dd00cf67c0ea73a980ddfff268fa3931a
MD5 31937be99c3192f2768cbce0c7d46b7a
BLAKE2b-256 0bd37b9627682c27005a8f40386c5d578a45cf1f23373206ffe5328c129be138

See more details on using hashes here.

File details

Details for the file bpe_knockout-2024.2.1-py3-none-any.whl.

File metadata

  • Download URL: bpe_knockout-2024.2.1-py3-none-any.whl
  • Upload date:
  • Size: 73.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.5 cpython/3.13.12 HTTPX/0.28.1

File hashes

Hashes for bpe_knockout-2024.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 db4c15c2802663d7654cf9b61aa3d9a714439ac827ab322f4d15984abbd8b1d2
MD5 d0b65dab849e5082f16001c38582900e
BLAKE2b-256 57c419e8825e183bf83bdece922c9f8edeb77a86c32c416a72a06ff15b5adcc7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page