Implementation of BPE-knockout, a morphologically informed post-processing step for BPE tokenisers.
Project description
BPE-knockout
Repo hosting all the code used for the BPE-knockout paper. Below are the instructions for reproducing and extending the intrinsic evaluations. Extrinsic evaluations are done with RobBERT's framework.
Data
All data is included in the repo, because it is obtainable for free elsewhere and free of license too.
- Morphological decompositions were derived from WebCelex at the Max Plank Institute.
- Language modelling data is derived from OSCAR on HuggingFace.
Running
- Unzip the
.rarfile underdata/compressed/. - Run
py main.pyorpython main.pyin a terminal.
Using your own data
It is possible to use other datasets (even other languages) than the ones used for the paper. Here is how you would do that:
- Make sure you have the following files:
- A word count file from a sufficiently large corpus;
- A file with morphological decompositions (not necessarily of the same words);
- Optional: if you don't want to generate a new BPE tokeniser from your word counts, the file(s) that specify your existing BPE tokeniser.
- If your morphological decompositions are not in CELEX format, you still need to write your own parser for the
morphology file. Do this in
src/datahandlers/morphology.pyby creating a subclass of the abstractLemmaMorphologyclass. - In
src/auxiliary/config.py, create a new function that creates aProjectConfigobject declaring the paths to all the relevant files, as well as the name of the relevantLemmaMorphologysubclass. Use thesetup()functions as examples. - In
main.py, specify this new config.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bpe_knockout-2024.2.1.tar.gz.
File metadata
- Download URL: bpe_knockout-2024.2.1.tar.gz
- Upload date:
- Size: 67.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: Hatch/1.16.5 cpython/3.13.12 HTTPX/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
43c04e58b9b10215fa139f6333714f2dd00cf67c0ea73a980ddfff268fa3931a
|
|
| MD5 |
31937be99c3192f2768cbce0c7d46b7a
|
|
| BLAKE2b-256 |
0bd37b9627682c27005a8f40386c5d578a45cf1f23373206ffe5328c129be138
|
File details
Details for the file bpe_knockout-2024.2.1-py3-none-any.whl.
File metadata
- Download URL: bpe_knockout-2024.2.1-py3-none-any.whl
- Upload date:
- Size: 73.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: Hatch/1.16.5 cpython/3.13.12 HTTPX/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
db4c15c2802663d7654cf9b61aa3d9a714439ac827ab322f4d15984abbd8b1d2
|
|
| MD5 |
d0b65dab849e5082f16001c38582900e
|
|
| BLAKE2b-256 |
57c419e8825e183bf83bdece922c9f8edeb77a86c32c416a72a06ff15b5adcc7
|