R-BPE: Improving BPE-Tokenizers with Token Reuse
Project description
R-BPE: Improving BPE-Tokenizers with Token Reuse
This repository accompanies the paper introducing R-BPE, a lightweight framework for adapting existing Byte-Pair Encoding (BPE) tokenizers to better support a specified target language. The method is demonstrated using Arabic as the target language. R-BPE reuses tokens from user-excluded languages and creates ID-based maps to resolve the new tokens of the chosen language. It is compatible with HuggingFace interfaces and thereby readily applicable to a wide range of existing models.
Overview
The RBPETokenizer orchestrates the entire process of:
- Classifying vocabulary tokens languages via
TokenClassifier. - Cleaning training data using
DataCleaner. - Training a new BPE tokenizer with
BPETokenizerTrainer. - Creating mappings between the original and new tokenizer with
MappingTokenizer. - Returning a final
RBPETokenizeradapted to the target language.
Prerequisites
Installation from GitHub
Using pip
pip install git+https://github.com/U4RASD/r-bpe.git
Using uv
uv add git+https://github.com/U4RASD/r-bpe.git
Installation from Local Directory
Using pip
- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install the package:
pip install .
Using uv
- Install uv if you haven't already:
curl -LsSf https://astral.sh/uv/install.sh | sh
- Create a virtual environment and install the package:
uv venv venv
source venv/bin/activate # On Windows: .venv\Scripts\activate
- Install the package:
uv sync
Creating an R-BPE Tokenizer
You can create an R-BPE tokenizer either through the command-line interface (CLI) or programmaticaly through the Python API.
Configuration Parameters
R-BPE uses the following configuration parameters:
| Parameter | Meaning | Necessity | Default Value |
|---|---|---|---|
| model_id | The HuggingFace model id of the original tokenizer. e.g. meta-llama/Llama-3.1-8B |
Required | None |
| training_data_dir | The directory where the training data for the new tokenizer is stored. | Required | None |
| clean_data | Whether to clean the training data or not. Warning: only set to false if you are sure that your training data does not include any non-preserved languages. | Required | True |
| cleaned_data_dir | The directory where the cleaned training data for the new tokenizer should be saved. | Optional | None |
| hf_token | The HuggingFace access token. | Required | None |
| min_reusable_count | The minimum number of tokens needed for reuse (threshold h in the paper). The size of the new tokenizer vocabulary will be <= min_reusable_count depending on how many reusable tokens are found in the specified original tokenizer. |
Optional | 20000 |
| target_language_scripts | List of the unicode script names or aliases of the target language. See this table for possible values. | Optional | Arabic |
| preserved_languages_scripts | List of the unicode script names or aliases of the languages that must be preserved. The target language scripts are preserved by default. See this table for possible values. | Optional | Latin, Greek |
| special_tokens | Dictionary of custom special tokens values for the main special tokens: pad_token, unk_token, bos_token, mask_token, sep_token, cls_token. |
Optional | None |
| additional_special_tokens | List of additional special tokens the new tokenizer will have. | Optional | None |
| apply_rbpe_arabic_norm | Whether to apply the R-BPE Arabic normalization during encoding or not. | optional | True |
Using the CLI
You have to supply output_dir which is the path where the created RBPETokenizer should be saved.
rbpe create-tokenizer --config path/to/config.yaml --output_dir path/to/tokenizer_output_dir
or
rbpe create-tokenizer --output_dir path/to/tokenizer_output_dir --model_id meta-llama/Llama-3.1-8B --output_dir ./rbpe_tokenizer --training_data_dir ./data --hf_token YOUR_TOKEN
Using the Python API
from rbpe import RBPETokenizer
# From a YAML config file
tokenizer_factory = RBPETokenizer.from_config('path/to/config.yaml')
# Or with explicit parameters
tokenizer_factory = RBPETokenizer(
model_id='meta-llama/Llama-3.1-8B',
training_data_dir='./data',
cleaned_data_dir='./data_cleaned',
target_language_scripts=['arabic'],
preserved_languages_scripts=['latin', 'greek'],
)
# Prepare the tokenizer
tokenizer = tokenizer_factory.prepare()
# You can directly use the tokenizer now
# Save the prepared R-BPE tokenizer for future use
tokenizer.save_pretrained('./rbpe_llama3_8b_tokenizer')
Using an R-BPE tokenizer
Once you have created your R-BPE tokenizer, you can use it the same way you use any HuggingFace tokenizer:
from rbpe import RBPETokenizer
tokenizer = RBPETokenizer.from_pretrained('./rbpe_llama3_8b_tokenizer')
text = 'مرحبا'
encoded = tokenizer(text)
decoded = tokenizer.decode(encoded['input_ids'])
print('Encoded:', encoded)
print('Decoded:', decoded)
Specifying Language Scripts
Language script specification is case insensitive. The following table shows all possible values you can use which are derived from the Unicode 17 data:
| Script Name | Script Alias |
|---|---|
| adlam | adlm |
| ahom | ahom |
| anatolian_hieroglyphs | hluw |
| arabic | arab |
| armenian | armn |
| avestan | avst |
| balinese | bali |
| bamum | bamu |
| bassa_vah | bass |
| batak | batk |
| bengali | beng |
| beria_erfe | berf |
| bhaiksuki | bhks |
| bopomofo | bopo |
| brahmi | brah |
| braille | brai |
| buginese | bugi |
| buhid | buhd |
| canadian_aboriginal | cans |
| carian | cari |
| caucasian_albanian | aghb |
| chakma | cakm |
| cham | cham |
| cherokee | cher |
| chorasmian | chrs |
| common | zyyy |
| coptic | copt |
| cuneiform | xsux |
| cypriot | cprt |
| cypro_minoan | cpmn |
| cyrillic | cyrl |
| deseret | dsrt |
| devanagari | deva |
| dives_akuru | diak |
| dogra | dogr |
| duployan | dupl |
| egyptian_hieroglyphs | egyp |
| elbasan | elba |
| elymaic | elym |
| ethiopic | ethi |
| garay | gara |
| georgian | geor |
| glagolitic | glag |
| gothic | goth |
| grantha | gran |
| greek | grek |
| gujarati | gujr |
| gunjala_gondi | gong |
| gurmukhi | guru |
| gurung_khema | gukh |
| han | hani |
| hangul | hang |
| hanifi_rohingya | rohg |
| hanunoo | hano |
| hatran | hatr |
| hebrew | hebr |
| hiragana | hira |
| imperial_aramaic | armi |
| inherited | zinh |
| inscriptional_pahlavi | phli |
| inscriptional_parthian | prti |
| javanese | java |
| kaithi | kthi |
| kannada | knda |
| katakana | kana |
| katakana_or_hiragana | hrkt |
| kawi | kawi |
| kayah_li | kali |
| kharoshthi | khar |
| khitan_small_script | kits |
| khmer | khmr |
| khojki | khoj |
| khudawadi | sind |
| kirat_rai | krai |
| lao | laoo |
| latin | latn |
| lepcha | lepc |
| limbu | limb |
| linear_a | lina |
| linear_b | linb |
| lisu | lisu |
| lycian | lyci |
| lydian | lydi |
| mahajani | mahj |
| makasar | maka |
| malayalam | mlym |
| mandaic | mand |
| manichaean | mani |
| marchen | marc |
| masaram_gondi | gonm |
| medefaidrin | medf |
| meetei_mayek | mtei |
| mende_kikakui | mend |
| meroitic_cursive | merc |
| meroitic_hieroglyphs | mero |
| miao | plrd |
| modi | modi |
| mongolian | mong |
| mro | mroo |
| multani | mult |
| myanmar | mymr |
| nabataean | nbat |
| nag_mundari | nagm |
| nandinagari | nand |
| new_tai_lue | talu |
| newa | newa |
| nko | nkoo |
| nushu | nshu |
| nyiakeng_puachue_hmong | hmnp |
| ogham | ogam |
| ol_chiki | olck |
| ol_onal | onao |
| old_hungarian | hung |
| old_italic | ital |
| old_north_arabian | narb |
| old_permic | perm |
| old_persian | xpeo |
| old_sogdian | sogo |
| old_south_arabian | sarb |
| old_turkic | orkh |
| old_uyghur | ougr |
| oriya | orya |
| osage | osge |
| osmanya | osma |
| pahawh_hmong | hmng |
| palmyrene | palm |
| pau_cin_hau | pauc |
| phags_pa | phag |
| phoenician | phnx |
| psalter_pahlavi | phlp |
| rejang | rjng |
| runic | runr |
| samaritan | samr |
| saurashtra | saur |
| sharada | shrd |
| shavian | shaw |
| siddham | sidd |
| sidetic | sidt |
| signwriting | sgnw |
| sinhala | sinh |
| sogdian | sogd |
| sora_sompeng | sora |
| soyombo | soyo |
| sundanese | sund |
| sunuwar | sunu |
| syloti_nagri | sylo |
| syriac | syrc |
| tagalog | tglg |
| tagbanwa | tagb |
| tai_le | tale |
| tai_tham | lana |
| tai_viet | tavt |
| tai_yo | tayo |
| takri | takr |
| tamil | taml |
| tangsa | tnsa |
| tangut | tang |
| telugu | telu |
| thaana | thaa |
| thai | thai |
| tibetan | tibt |
| tifinagh | tfng |
| tirhuta | tirh |
| todhri | todr |
| tolong_siki | tols |
| toto | toto |
| tulu_tigalari | tutg |
| ugaritic | ugar |
| unknown | zzzz |
| vai | vaii |
| vithkuqi | vith |
| wancho | wcho |
| warang_citi | wara |
| yezidi | yezi |
| yi | yiii |
| zanabazar_square | zanb |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rbpe-0.1.1.tar.gz.
File metadata
- Download URL: rbpe-0.1.1.tar.gz
- Upload date:
- Size: 113.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.9 {"installer":{"name":"uv","version":"0.9.9"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4584fdfd8b29e3069ab0843f0255b347e4f7c7de88558014a89a35a198beb449
|
|
| MD5 |
c4f39bfba50007b2698204619b20f936
|
|
| BLAKE2b-256 |
803c1940fb9fc8c1d5394d337841ac391986cc571dc13b3833a3bdbb09f985cd
|
File details
Details for the file rbpe-0.1.1-py3-none-any.whl.
File metadata
- Download URL: rbpe-0.1.1-py3-none-any.whl
- Upload date:
- Size: 118.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.9 {"installer":{"name":"uv","version":"0.9.9"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8675bd484bf8222868559795f43fda7125c0c629268c9e16bd16caea055902ed
|
|
| MD5 |
df799f10d6b388ce2dbed889f76e353e
|
|
| BLAKE2b-256 |
8e39dddfa2e095a1dd77bfad9f7704fb2e54139ed06ea5c7ebf4ba29fc784cb8
|