Fast Rust tokenization without Python garbage collection slowdowns
Reason this release was yanked:
imprecise python version spec
Project description
Rust-based Python Tokenizer to Arrow
This project provides a tiny Python extension written in Rust for tokenizing text data directly into Arrow arrays. It uses the tokenizers library from Hugging Face and serializes the output directly into a pyarrow.LargeListArray without exposing any Encoding objects to the Python interpreter.
The primary benefit of this approach is memory efficiency. By avoiding the creation of intermediate Python lists of integers, it significantly reduces the overhead on the Python garbage collector, making it ideal for tokenizing very large datasets. This can provide a major speedup for large batch tokenization jobs, which otherwise can get bottlenecked on Python GC.
Features
- Very simple API
- Provides a zero-copy data transfer of resulting Arrow arrays
Setup and Installation
Prerequisites
- Install Rust
- Install Python
Installation Steps
uv run maturin develop
Arrow-based Tokenization
Luxical bundles in the arrow-tokenize Rust extension package, which exposes ArrowTokenizer, a fast tokenizer abstraction that operates on Arrow arrays and returns Arrow arrays. This avoids Python‑level overhead during large batch tokenization. Compared to the Python API of the tokenizers library, arrow-tokenize dramatically reduces pressure on the Python garbage collector and can deliver substantial speedups in bulk tokenization.
Key Python API functions:
load_arrow_tokenizer_from_pretrained(tokenizer_id: str) -> ArrowTokenizerload_arrow_tokenizer_from_file(tokenizer_file: Path | str) -> ArrowTokenizerarrow_tokenize_texts(texts: list[str], arrow_tokenizer: ArrowTokenizer, *, batch_size=4096, add_special_tokens=False, progress_bar=True) -> pa.ChunkedArray
Minimal example:
from luxical.tokenization import (
load_arrow_tokenizer_from_pretrained,
arrow_tokenize_texts,
)
tok = load_arrow_tokenizer_from_pretrained("google-bert/bert-base-uncased")
chunks = arrow_tokenize_texts([
"hello world",
"lexical embeddings are fast",
], tok, batch_size=2, add_special_tokens=False, progress_bar=False)
Release Notes
v1.0.1 - 2025-11-24
- Correct license metadata in
pyproject.toml
v1.0.0 - 2025-09-22
- Initial release. Intended to be the only release until we need to bump dependency versions.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arrow_tokenize-1.0.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: arrow_tokenize-1.0.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 2.4 MB
- Tags: CPython 3.13, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bb0835766bfe4cd8fa64f7d31d6200d4e6a01d4a1a7a8166933aaf6a81c04d1a
|
|
| MD5 |
27455e9ad5d4fbc43af7ccd56e9fec82
|
|
| BLAKE2b-256 |
a88db51df6d7ceaa6490df3c89b685a98a07534fbf8de8dbe314925c49c89c54
|
File details
Details for the file arrow_tokenize-1.0.1-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: arrow_tokenize-1.0.1-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 2.3 MB
- Tags: CPython 3.13, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
164999ddd5442902e5e67abd9a23f4191156895fca63322cf0ede01715f8ab67
|
|
| MD5 |
9a4d09b674ff14f8f737132fe3bb9e77
|
|
| BLAKE2b-256 |
720ea466a694304350ef53660b2e92dab01d83a54e3d7af3d0650ccab4efa825
|
File details
Details for the file arrow_tokenize-1.0.1-cp313-cp313-macosx_11_0_arm64.whl.
File metadata
- Download URL: arrow_tokenize-1.0.1-cp313-cp313-macosx_11_0_arm64.whl
- Upload date:
- Size: 2.1 MB
- Tags: CPython 3.13, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b55e9e172ef82eb26a632331a1c08945944f2c8bb2687827abca358dab3ef7d0
|
|
| MD5 |
8a070f098a108b0844c3348af0600f75
|
|
| BLAKE2b-256 |
093df838fbfae97a6fea5f392ed5df7ee5ed6d3ce7dda9cd93e26c965dc1a1f9
|
File details
Details for the file arrow_tokenize-1.0.1-cp313-cp313-macosx_10_12_x86_64.whl.
File metadata
- Download URL: arrow_tokenize-1.0.1-cp313-cp313-macosx_10_12_x86_64.whl
- Upload date:
- Size: 2.2 MB
- Tags: CPython 3.13, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3e256bd737ef3b6f52f5eefa0b375d9e2a599098cab4da2146305117aaaefbcd
|
|
| MD5 |
8dc4ac660a4ce7cf0a1d235233258c9c
|
|
| BLAKE2b-256 |
814f9df1bfda36946e949dabc0cd08b3919d2b37fdc8847d4077acaf195465a8
|
File details
Details for the file arrow_tokenize-1.0.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: arrow_tokenize-1.0.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 2.4 MB
- Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
14f563c5515bd10257f26a09e19226187cd5c9a1900ef4b0725511722d034fad
|
|
| MD5 |
1302abbbdb9aa54403e9b9aea8b30f0b
|
|
| BLAKE2b-256 |
92a404836f4c019947456a36c616ebc7a94badeb956b7f33d675b6e32e37047c
|
File details
Details for the file arrow_tokenize-1.0.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: arrow_tokenize-1.0.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 2.3 MB
- Tags: CPython 3.12, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9217e7ea4a169e6a87c7d1147bf5a0f943693d03858912fe544c21b1c0c2310e
|
|
| MD5 |
af7b7105ab3defb5fe9000ea32ff8aa3
|
|
| BLAKE2b-256 |
21c1866f79ccc0fa7b878e66b42632b99357aaf20371869876c67616a78f9597
|
File details
Details for the file arrow_tokenize-1.0.1-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: arrow_tokenize-1.0.1-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 2.1 MB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
29acaee3b52f11a9fa0a77e539fe27ba8394488274788ad800f6de0fc93b46fe
|
|
| MD5 |
38078740d990e84ce98db29766ae6585
|
|
| BLAKE2b-256 |
0e8514412344beaac92859e75a9f24120c59e0ab92f9ab9128adeeb32a8f49e4
|
File details
Details for the file arrow_tokenize-1.0.1-cp312-cp312-macosx_10_12_x86_64.whl.
File metadata
- Download URL: arrow_tokenize-1.0.1-cp312-cp312-macosx_10_12_x86_64.whl
- Upload date:
- Size: 2.2 MB
- Tags: CPython 3.12, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f5baf670894f592c9f3c628cb2da507ae72af47e0094fda0650c22a4950e366b
|
|
| MD5 |
84ce4eb1f72149841cbd4707d28f6a06
|
|
| BLAKE2b-256 |
2f032345ecd4b70df9a4a29ea041ac52333ee9936e66a795640fd0bafa52e835
|
File details
Details for the file arrow_tokenize-1.0.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: arrow_tokenize-1.0.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 2.4 MB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
986d81381a573282ebedfb3fbe7448a4598445b9b8b47beb666f65fe32db5048
|
|
| MD5 |
acd0a6c94c4d78f416f5418de9444178
|
|
| BLAKE2b-256 |
ce33eb7f42ee7bec827faa37dd779a2af684d884ed084feb1b611194268bab47
|
File details
Details for the file arrow_tokenize-1.0.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: arrow_tokenize-1.0.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 2.3 MB
- Tags: CPython 3.11, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fad820cb9a298bdd92ce9fe8f105dd3f0d297f6bb51f8dc4412c67e90631bf23
|
|
| MD5 |
2b3bcb9d508cc4439c606d4744510d4f
|
|
| BLAKE2b-256 |
4b6d2ef0563eb0de3c0193543138f1d41a174b42f8e94acf7b198c0169628725
|
File details
Details for the file arrow_tokenize-1.0.1-cp311-cp311-macosx_11_0_arm64.whl.
File metadata
- Download URL: arrow_tokenize-1.0.1-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 2.1 MB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f8a0ea1288d9de68dac090fd34652c64be8943519c762d9e2e3a8214233107b5
|
|
| MD5 |
0fcb20c3da91f6025e92ac34d8141fa6
|
|
| BLAKE2b-256 |
4beb128ac1dfc574307e36fade772327f25e97b784b3a0038798e63461413a88
|
File details
Details for the file arrow_tokenize-1.0.1-cp311-cp311-macosx_10_12_x86_64.whl.
File metadata
- Download URL: arrow_tokenize-1.0.1-cp311-cp311-macosx_10_12_x86_64.whl
- Upload date:
- Size: 2.2 MB
- Tags: CPython 3.11, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6b25b3e236935ba7f3371f9754638c63caa748bfba555ea03657380cd63606d3
|
|
| MD5 |
bef73e77440d6ebcdd5064c89a3409ed
|
|
| BLAKE2b-256 |
fd9576a2ea7d4e36eecf5cb1a5c8e12b74d7933ca9b67b8641caf4bbd42a4118
|