R-BPE: Improving BPE-Tokenizers with Token Reuse
Project description
R-BPE: Improving BPE-Tokenizers with Token Reuse
This repository accompanies the paper introducing R-BPE, a lightweight framework for adapting existing Byte-Pair Encoding (BPE) tokenizers to better support a specified target language. The method is demonstrated using Arabic as the target language. R-BPE reuses tokens from user-excluded languages and creates ID-based maps to resolve the new tokens of the chosen language. It is compatible with HuggingFace interfaces and thereby readily applicable to a wide range of existing models.
Overview
The RBPETokenizer orchestrates the entire process of:
- Classifying vocabulary tokens languages via
TokenClassifier. - Cleaning training data using
DataCleaner. - Training a new BPE tokenizer with
BPETokenizerTrainer. - Creating mappings between the original and new tokenizer with
MappingTokenizer. - Returning a final
RBPETokenizeradapted to the target language.
Prerequisites
Installation from GitHub
Using pip
pip install rbpe
Using uv
uv add rbpe
Installation from Local Directory
Using pip
- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install the package:
pip install .
Using uv
- Install uv if you haven't already:
curl -LsSf https://astral.sh/uv/install.sh | sh
- Create a virtual environment and install the package:
uv venv venv
source venv/bin/activate # On Windows: .venv\Scripts\activate
- Install the package:
uv sync
Creating an R-BPE Tokenizer
You can create an R-BPE tokenizer either through the command-line interface (CLI) or programmaticaly through the Python API.
Configuration Parameters
R-BPE uses the following configuration parameters:
| Parameter | Meaning | Necessity | Default Value |
|---|---|---|---|
| model_id | The HuggingFace model id of the original tokenizer. e.g. meta-llama/Llama-3.1-8B |
Required | None |
| training_data_dir | The directory where the training data for the new tokenizer is stored. | Required | None |
| clean_data | Whether to clean the training data or not. Warning: only set to false if you are sure that your training data does not include any non-preserved languages. | Required | True |
| cleaned_data_dir | The directory where the cleaned training data for the new tokenizer should be saved. | Optional | None |
| hf_token | The HuggingFace access token. | Required | None |
| min_reusable_count | The minimum number of tokens needed for reuse (threshold h in the paper). The size of the new tokenizer vocabulary will be <= min_reusable_count depending on how many reusable tokens are found in the specified original tokenizer. |
Optional | 20000 |
| target_language_scripts | List of the unicode script names or aliases of the target language. See this table for possible values. | Optional | Arabic |
| preserved_languages_scripts | List of the unicode script names or aliases of the languages that must be preserved. The target language scripts are preserved by default. See this table for possible values. | Optional | Latin, Greek |
| special_tokens | Dictionary of custom special tokens values for the main special tokens: pad_token, unk_token, bos_token, mask_token, sep_token, cls_token. |
Optional | None |
| additional_special_tokens | List of additional special tokens the new tokenizer will have. | Optional | None |
| apply_rbpe_arabic_norm | Whether to apply the R-BPE Arabic normalization during encoding or not. | optional | True |
Using the CLI
You have to supply output_dir which is the path where the created RBPETokenizer should be saved.
rbpe create-tokenizer --config path/to/config.yaml --output_dir path/to/tokenizer_output_dir
or
rbpe create-tokenizer --output_dir path/to/tokenizer_output_dir --model_id meta-llama/Llama-3.1-8B --output_dir ./rbpe_tokenizer --training_data_dir ./data --hf_token YOUR_TOKEN
Using the Python API
from rbpe import RBPETokenizer
# From a YAML config file
tokenizer_factory = RBPETokenizer.from_config('path/to/config.yaml')
# Or with explicit parameters
tokenizer_factory = RBPETokenizer(
model_id='meta-llama/Llama-3.1-8B',
training_data_dir='./data',
cleaned_data_dir='./data_cleaned',
target_language_scripts=['arabic'],
preserved_languages_scripts=['latin', 'greek'],
)
# Prepare the tokenizer
tokenizer = tokenizer_factory.prepare()
# You can directly use the tokenizer now
# Save the prepared R-BPE tokenizer for future use
tokenizer.save_pretrained('./rbpe_llama3_8b_tokenizer')
Using an R-BPE tokenizer
Once you have created your R-BPE tokenizer, you can use it the same way you use any HuggingFace tokenizer:
from rbpe import RBPETokenizer
tokenizer = RBPETokenizer.from_pretrained('./rbpe_llama3_8b_tokenizer')
text = 'مرحبا'
encoded = tokenizer(text)
decoded = tokenizer.decode(encoded['input_ids'])
print('Encoded:', encoded)
print('Decoded:', decoded)
Shipping an R-BPE tokenizer with a model
When publishing a model trained with an R-BPE tokenizer, copy the contents of the saved tokenizer directory into the model directory. The rbpe tokenizer_config.json, tokenizer.json, and special_tokens_map.json overwrite the originals saved with the model.
my-model/
├── config.json
├── model-00001-of-00004.safetensors
├── ...
├── tokenizer_config.json
├── tokenizer.json
├── special_tokens_map.json
├── metadata/
├── new_tokenizer/
└── old_tokenizer/
Once setup this way, AutoModelForCausalLM.from_pretrained(repo_id) and RBPETokenizer.from_pretrained(repo_id) both resolve from the same root.
Loading from the Hugging Face Hub
RBPETokenizer.from_pretrained accepts a Hub repo id, just like AutoTokenizer.from_pretrained, and supports the usual cache_dir, token, revision, and local_files_only kwargs.
from rbpe import RBPETokenizer
from transformers import AutoModelForCausalLM
import torch
repo_id = 'user/repo'
model = AutoModelForCausalLM.from_pretrained(repo_id, dtype=torch.bfloat16, device_map='auto')
tokenizer = RBPETokenizer.from_pretrained(repo_id)
Citation
If you use R-BPE, please cite:
@inproceedings{hamdan-etal-2025-r,
title = "{R}-{BPE}: Improving {BPE}-Tokenizers with Token Reuse",
author = "Hamdan, Nancy and
Rakan Al Mraikhat, Osama and
Zaraket, Fadi A.",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.emnlp-main.1169/",
doi = "10.18653/v1/2025.emnlp-main.1169",
pages = "22951--22959",
ISBN = "979-8-89176-332-6",
abstract = "This paper presents R-BPE, a lightweight framework for adapting existing Byte-Pair Encoding (BPE) tokenizers to better support a specified target language. It reuses tokens from user-excluded languages and creates ID-based maps to resolve the new tokens of the chosen language. We evaluate R-BPE on Arabic as a target language. R-BPE reduced subword fertility by an average of 24.4{\%} across the LLaMA 3.1 8B, Command R 35B, and Qwen 3 8B models. Applied to LLaMA 3.1 8B in continued pretraining mode, R-BPE yields a 7.33{\%} reduction in training time. On the ArabicMMLU benchmark, the resulting model improved by 5.09 points on five in-domain topics and matched the original model{'}s overall performance. It also preserved performance on EnglishMMLU. R-BPE effectively leverages existing models' tokenizers, embedding layers, and performance to better support target languages without incurring model size changes. We release an R-BPE implementation that is compatible with HuggingFace interfaces and thereby readily applicable to a wide range of existing models at \url{https://acr.ps/1L9GPmL}."
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rbpe-0.1.2.tar.gz.
File metadata
- Download URL: rbpe-0.1.2.tar.gz
- Upload date:
- Size: 122.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ead375c8994655e65c978e5f46ecdca2df69bde4b64e47280066a4b309cbe415
|
|
| MD5 |
31c98e1a3652a5d08d9773fe95355f1f
|
|
| BLAKE2b-256 |
18d86525e1430489e99b6298229a873137998801298499ad3ccb6978b728f825
|
File details
Details for the file rbpe-0.1.2-py3-none-any.whl.
File metadata
- Download URL: rbpe-0.1.2-py3-none-any.whl
- Upload date:
- Size: 118.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4bac2f9f8fd08d932cfa7e7e50103c221b4237eeb2671a6195df2d93f13e4585
|
|
| MD5 |
c58c0e34fbf294f976f41b12160ac957
|
|
| BLAKE2b-256 |
fe320a866a74f2ef80cd9a309ee205e256a61c4ca744c5ad327fdf1ca72b4820
|