Skip to main content

Fast Rust tokenization without Python garbage collection slowdowns

Reason this release was yanked:

imprecise python version spec

Project description

Rust-based Python Tokenizer to Arrow

This project provides a tiny Python extension written in Rust for tokenizing text data directly into Arrow arrays. It uses the tokenizers library from Hugging Face and serializes the output directly into a pyarrow.LargeListArray without exposing any Encoding objects to the Python interpreter.

The primary benefit of this approach is memory efficiency. By avoiding the creation of intermediate Python lists of integers, it significantly reduces the overhead on the Python garbage collector, making it ideal for tokenizing very large datasets. This can provide a major speedup for large batch tokenization jobs, which otherwise can get bottlenecked on Python GC.

Features

  • Very simple API
  • Provides a zero-copy data transfer of resulting Arrow arrays

Setup and Installation

Prerequisites

  • Install Rust
  • Install Python

Installation Steps

uv run maturin develop

Arrow-based Tokenization

Luxical bundles in the arrow-tokenize Rust extension package, which exposes ArrowTokenizer, a fast tokenizer abstraction that operates on Arrow arrays and returns Arrow arrays. This avoids Python‑level overhead during large batch tokenization. Compared to the Python API of the tokenizers library, arrow-tokenize dramatically reduces pressure on the Python garbage collector and can deliver substantial speedups in bulk tokenization.

Key Python API functions:

  • load_arrow_tokenizer_from_pretrained(tokenizer_id: str) -> ArrowTokenizer
  • load_arrow_tokenizer_from_file(tokenizer_file: Path | str) -> ArrowTokenizer
  • arrow_tokenize_texts(texts: list[str], arrow_tokenizer: ArrowTokenizer, *, batch_size=4096, add_special_tokens=False, progress_bar=True) -> pa.ChunkedArray

Minimal example:

from luxical.tokenization import (
    load_arrow_tokenizer_from_pretrained,
    arrow_tokenize_texts,
)

tok = load_arrow_tokenizer_from_pretrained("google-bert/bert-base-uncased")
chunks = arrow_tokenize_texts([
    "hello world",
    "lexical embeddings are fast",
], tok, batch_size=2, add_special_tokens=False, progress_bar=False)

Release Notes

v1.0.1 - 2025-11-24

  • Correct license metadata in pyproject.toml

v1.0.0 - 2025-09-22

  • Initial release. Intended to be the only release until we need to bump dependency versions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

arrow_tokenize-1.0.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

arrow_tokenize-1.0.1-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ ARM64

arrow_tokenize-1.0.1-cp313-cp313-macosx_11_0_arm64.whl (2.1 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

arrow_tokenize-1.0.1-cp313-cp313-macosx_10_12_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

arrow_tokenize-1.0.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

arrow_tokenize-1.0.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ ARM64

arrow_tokenize-1.0.1-cp312-cp312-macosx_11_0_arm64.whl (2.1 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

arrow_tokenize-1.0.1-cp312-cp312-macosx_10_12_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

arrow_tokenize-1.0.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

arrow_tokenize-1.0.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ ARM64

arrow_tokenize-1.0.1-cp311-cp311-macosx_11_0_arm64.whl (2.1 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

arrow_tokenize-1.0.1-cp311-cp311-macosx_10_12_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

File details

Details for the file arrow_tokenize-1.0.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 bb0835766bfe4cd8fa64f7d31d6200d4e6a01d4a1a7a8166933aaf6a81c04d1a
MD5 27455e9ad5d4fbc43af7ccd56e9fec82
BLAKE2b-256 a88db51df6d7ceaa6490df3c89b685a98a07534fbf8de8dbe314925c49c89c54

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.1-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.1-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 164999ddd5442902e5e67abd9a23f4191156895fca63322cf0ede01715f8ab67
MD5 9a4d09b674ff14f8f737132fe3bb9e77
BLAKE2b-256 720ea466a694304350ef53660b2e92dab01d83a54e3d7af3d0650ccab4efa825

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.1-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.1-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b55e9e172ef82eb26a632331a1c08945944f2c8bb2687827abca358dab3ef7d0
MD5 8a070f098a108b0844c3348af0600f75
BLAKE2b-256 093df838fbfae97a6fea5f392ed5df7ee5ed6d3ce7dda9cd93e26c965dc1a1f9

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.1-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.1-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 3e256bd737ef3b6f52f5eefa0b375d9e2a599098cab4da2146305117aaaefbcd
MD5 8dc4ac660a4ce7cf0a1d235233258c9c
BLAKE2b-256 814f9df1bfda36946e949dabc0cd08b3919d2b37fdc8847d4077acaf195465a8

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 14f563c5515bd10257f26a09e19226187cd5c9a1900ef4b0725511722d034fad
MD5 1302abbbdb9aa54403e9b9aea8b30f0b
BLAKE2b-256 92a404836f4c019947456a36c616ebc7a94badeb956b7f33d675b6e32e37047c

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 9217e7ea4a169e6a87c7d1147bf5a0f943693d03858912fe544c21b1c0c2310e
MD5 af7b7105ab3defb5fe9000ea32ff8aa3
BLAKE2b-256 21c1866f79ccc0fa7b878e66b42632b99357aaf20371869876c67616a78f9597

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.1-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.1-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 29acaee3b52f11a9fa0a77e539fe27ba8394488274788ad800f6de0fc93b46fe
MD5 38078740d990e84ce98db29766ae6585
BLAKE2b-256 0e8514412344beaac92859e75a9f24120c59e0ab92f9ab9128adeeb32a8f49e4

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.1-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.1-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 f5baf670894f592c9f3c628cb2da507ae72af47e0094fda0650c22a4950e366b
MD5 84ce4eb1f72149841cbd4707d28f6a06
BLAKE2b-256 2f032345ecd4b70df9a4a29ea041ac52333ee9936e66a795640fd0bafa52e835

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 986d81381a573282ebedfb3fbe7448a4598445b9b8b47beb666f65fe32db5048
MD5 acd0a6c94c4d78f416f5418de9444178
BLAKE2b-256 ce33eb7f42ee7bec827faa37dd779a2af684d884ed084feb1b611194268bab47

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 fad820cb9a298bdd92ce9fe8f105dd3f0d297f6bb51f8dc4412c67e90631bf23
MD5 2b3bcb9d508cc4439c606d4744510d4f
BLAKE2b-256 4b6d2ef0563eb0de3c0193543138f1d41a174b42f8e94acf7b198c0169628725

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.1-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.1-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f8a0ea1288d9de68dac090fd34652c64be8943519c762d9e2e3a8214233107b5
MD5 0fcb20c3da91f6025e92ac34d8141fa6
BLAKE2b-256 4beb128ac1dfc574307e36fade772327f25e97b784b3a0038798e63461413a88

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.1-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.1-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 6b25b3e236935ba7f3371f9754638c63caa748bfba555ea03657380cd63606d3
MD5 bef73e77440d6ebcdd5064c89a3409ed
BLAKE2b-256 fd9576a2ea7d4e36eecf5cb1a5c8e12b74d7933ca9b67b8641caf4bbd42a4118

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page