R-BPE: Improving BPE-Tokenizers with Token Reuse

These details have not been verified by PyPI

Project links

Project description

R-BPE: Improving BPE-Tokenizers with Token Reuse

This repository accompanies the paper introducing R-BPE, a lightweight framework for adapting existing Byte-Pair Encoding (BPE) tokenizers to better support a specified target language. The method is demonstrated using Arabic as the target language. R-BPE reuses tokens from user-excluded languages and creates ID-based maps to resolve the new tokens of the chosen language. It is compatible with HuggingFace interfaces and thereby readily applicable to a wide range of existing models.

Overview

The RBPETokenizer orchestrates the entire process of:

Classifying vocabulary tokens languages via TokenClassifier.
Cleaning training data using DataCleaner.
Training a new BPE tokenizer with BPETokenizerTrainer.
Creating mappings between the original and new tokenizer with MappingTokenizer.
Returning a final RBPETokenizer adapted to the target language.

Prerequisites

Installation from GitHub

Using pip

pip install rbpe

Using uv

uv add rbpe

Installation from Local Directory

Using pip

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install the package:

pip install .

Using uv

Install uv if you haven't already:

curl -LsSf https://astral.sh/uv/install.sh | sh

Create a virtual environment and install the package:

uv venv venv
source venv/bin/activate  # On Windows: .venv\Scripts\activate

Install the package:

uv sync

Creating an R-BPE Tokenizer

You can create an R-BPE tokenizer either through the command-line interface (CLI) or programmaticaly through the Python API.

Configuration Parameters

R-BPE uses the following configuration parameters:

Parameter	Meaning	Necessity	Default Value
model_id	The HuggingFace model id of the original tokenizer. e.g. `meta-llama/Llama-3.1-8B`	Required	None
training_data_dir	The directory where the training data for the new tokenizer is stored.	Required	None
clean_data	Whether to clean the training data or not. Warning: only set to false if you are sure that your training data does not include any non-preserved languages.	Required	True
cleaned_data_dir	The directory where the cleaned training data for the new tokenizer should be saved.	Optional	None
hf_token	The HuggingFace access token.	Required	None
min_reusable_count	The minimum number of tokens needed for reuse (threshold h in the paper). The size of the new tokenizer vocabulary will be <= `min_reusable_count` depending on how many reusable tokens are found in the specified original tokenizer.	Optional	20000
target_language_scripts	List of the unicode script names or aliases of the target language. See this table for possible values.	Optional	Arabic
preserved_languages_scripts	List of the unicode script names or aliases of the languages that must be preserved. The target language scripts are preserved by default. See this table for possible values.	Optional	Latin, Greek
special_tokens	Dictionary of custom special tokens values for the main special tokens: `pad_token`, `unk_token`, `bos_token`, `mask_token`, `sep_token`, `cls_token`.	Optional	None
additional_special_tokens	List of additional special tokens the new tokenizer will have.	Optional	None
apply_rbpe_arabic_norm	Whether to apply the R-BPE Arabic normalization during encoding or not.	optional	True

Using the CLI

You have to supply output_dir which is the path where the created RBPETokenizer should be saved.

rbpe create-tokenizer --config path/to/config.yaml --output_dir path/to/tokenizer_output_dir

rbpe create-tokenizer --output_dir path/to/tokenizer_output_dir --model_id meta-llama/Llama-3.1-8B --output_dir ./rbpe_tokenizer --training_data_dir ./data --hf_token YOUR_TOKEN

Using the Python API

from rbpe import RBPETokenizer

# From a YAML config file
tokenizer_factory = RBPETokenizer.from_config('path/to/config.yaml')

# Or with explicit parameters
tokenizer_factory = RBPETokenizer(
    model_id='meta-llama/Llama-3.1-8B',
    training_data_dir='./data',
    cleaned_data_dir='./data_cleaned',
    target_language_scripts=['arabic'],
    preserved_languages_scripts=['latin', 'greek'],
)

# Prepare the tokenizer
tokenizer = tokenizer_factory.prepare()

# You can directly use the tokenizer now

# Save the prepared R-BPE tokenizer for future use
tokenizer.save_pretrained('./rbpe_llama3_8b_tokenizer')

Using an R-BPE tokenizer

Once you have created your R-BPE tokenizer, you can use it the same way you use any HuggingFace tokenizer:

from rbpe import RBPETokenizer

tokenizer = RBPETokenizer.from_pretrained('./rbpe_llama3_8b_tokenizer')

text = 'مرحبا'
encoded = tokenizer(text)
decoded = tokenizer.decode(encoded['input_ids'])

print('Encoded:', encoded)
print('Decoded:', decoded)

Shipping an R-BPE tokenizer with a model

When publishing a model trained with an R-BPE tokenizer, copy the contents of the saved tokenizer directory into the model directory. The rbpe tokenizer_config.json, tokenizer.json, and special_tokens_map.json overwrite the originals saved with the model.

my-model/
├── config.json
├── model-00001-of-00004.safetensors
├── ...
├── tokenizer_config.json
├── tokenizer.json
├── special_tokens_map.json
├── metadata/
├── new_tokenizer/
└── old_tokenizer/

Once setup this way, AutoModelForCausalLM.from_pretrained(repo_id) and RBPETokenizer.from_pretrained(repo_id) both resolve from the same root.

Loading from the Hugging Face Hub

RBPETokenizer.from_pretrained accepts a Hub repo id, just like AutoTokenizer.from_pretrained, and supports the usual cache_dir, token, revision, and local_files_only kwargs.

from rbpe import RBPETokenizer
from transformers import AutoModelForCausalLM
import torch

repo_id = 'user/repo'

model = AutoModelForCausalLM.from_pretrained(repo_id, dtype=torch.bfloat16, device_map='auto')
tokenizer = RBPETokenizer.from_pretrained(repo_id)

Citation

If you use R-BPE, please cite:

@inproceedings{hamdan-etal-2025-r,
    title = "{R}-{BPE}: Improving {BPE}-Tokenizers with Token Reuse",
    author = "Hamdan, Nancy  and
      Rakan Al Mraikhat, Osama  and
      Zaraket, Fadi A.",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.1169/",
    doi = "10.18653/v1/2025.emnlp-main.1169",
    pages = "22951--22959",
    ISBN = "979-8-89176-332-6",
    abstract = "This paper presents R-BPE, a lightweight framework for adapting existing Byte-Pair Encoding (BPE) tokenizers to better support a specified target language. It reuses tokens from user-excluded languages and creates ID-based maps to resolve the new tokens of the chosen language. We evaluate R-BPE on Arabic as a target language. R-BPE reduced subword fertility by an average of 24.4{\%} across the LLaMA 3.1 8B, Command R 35B, and Qwen 3 8B models. Applied to LLaMA 3.1 8B in continued pretraining mode, R-BPE yields a 7.33{\%} reduction in training time. On the ArabicMMLU benchmark, the resulting model improved by 5.09 points on five in-domain topics and matched the original model{'}s overall performance. It also preserved performance on EnglishMMLU. R-BPE effectively leverages existing models' tokenizers, embedding layers, and performance to better support target languages without incurring model size changes. We release an R-BPE implementation that is compatible with HuggingFace interfaces and thereby readily applicable to a wide range of existing models at \url{https://acr.ps/1L9GPmL}."
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.2

Apr 28, 2026

0.1.1

Nov 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rbpe-0.1.2.tar.gz (122.5 kB view details)

Uploaded Apr 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rbpe-0.1.2-py3-none-any.whl (118.3 kB view details)

Uploaded Apr 28, 2026 Python 3

File details

Details for the file rbpe-0.1.2.tar.gz.

File metadata

Download URL: rbpe-0.1.2.tar.gz
Upload date: Apr 28, 2026
Size: 122.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for rbpe-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`ead375c8994655e65c978e5f46ecdca2df69bde4b64e47280066a4b309cbe415`
MD5	`31c98e1a3652a5d08d9773fe95355f1f`
BLAKE2b-256	`18d86525e1430489e99b6298229a873137998801298499ad3ccb6978b728f825`

See more details on using hashes here.

File details

Details for the file rbpe-0.1.2-py3-none-any.whl.

File metadata

Download URL: rbpe-0.1.2-py3-none-any.whl
Upload date: Apr 28, 2026
Size: 118.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for rbpe-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4bac2f9f8fd08d932cfa7e7e50103c221b4237eeb2671a6195df2d93f13e4585`
MD5	`c58c0e34fbf294f976f41b12160ac957`
BLAKE2b-256	`fe320a866a74f2ef80cd9a309ee205e256a61c4ca744c5ad327fdf1ca72b4820`

See more details on using hashes here.

rbpe 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

R-BPE: Improving BPE-Tokenizers with Token Reuse

Overview

Prerequisites

Installation from GitHub

Using pip

Using uv

Installation from Local Directory

Using pip

Using uv

Creating an R-BPE Tokenizer

Configuration Parameters

Using the CLI

Using the Python API

Using an R-BPE tokenizer

Shipping an R-BPE tokenizer with a model

Loading from the Hugging Face Hub

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes