Skip to main content

Fast Rust tokenization without Python garbage collection slowdowns

Reason this release was yanked:

imprecise python version spec

Project description

Rust-based Python Tokenizer to Arrow

This project provides a tiny Python extension written in Rust for tokenizing text data directly into Arrow arrays. It uses the tokenizers library from Hugging Face and serializes the output directly into a pyarrow.LargeListArray without exposing any Encoding objects to the Python interpreter.

The primary benefit of this approach is memory efficiency. By avoiding the creation of intermediate Python lists of integers, it significantly reduces the overhead on the Python garbage collector, making it ideal for tokenizing very large datasets. This can provide a major speedup for large batch tokenization jobs, which otherwise can get bottlenecked on Python GC.

Features

  • Very simple API
  • Provides a zero-copy data transfer of resulting Arrow arrays

Setup and Installation

Prerequisites

  • Install Rust
  • Install Python

Installation Steps

uv run maturin develop

Arrow-based Tokenization

Luxical bundles in the arrow-tokenize Rust extension package, which exposes ArrowTokenizer, a fast tokenizer abstraction that operates on Arrow arrays and returns Arrow arrays. This avoids Python‑level overhead during large batch tokenization. Compared to the Python API of the tokenizers library, arrow-tokenize dramatically reduces pressure on the Python garbage collector and can deliver substantial speedups in bulk tokenization.

Key Python API functions:

  • load_arrow_tokenizer_from_pretrained(tokenizer_id: str) -> ArrowTokenizer
  • load_arrow_tokenizer_from_file(tokenizer_file: Path | str) -> ArrowTokenizer
  • arrow_tokenize_texts(texts: list[str], arrow_tokenizer: ArrowTokenizer, *, batch_size=4096, add_special_tokens=False, progress_bar=True) -> pa.ChunkedArray

Minimal example:

from luxical.tokenization import (
    load_arrow_tokenizer_from_pretrained,
    arrow_tokenize_texts,
)

tok = load_arrow_tokenizer_from_pretrained("google-bert/bert-base-uncased")
chunks = arrow_tokenize_texts([
    "hello world",
    "lexical embeddings are fast",
], tok, batch_size=2, add_special_tokens=False, progress_bar=False)

Release Notes

v1.0.0 - 2025-09-22

  • Initial release. Intended to be the only release until we need to bump dependency versions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

arrow_tokenize-1.0.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

arrow_tokenize-1.0.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ ARM64

arrow_tokenize-1.0.0-cp313-cp313-macosx_11_0_arm64.whl (2.1 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

arrow_tokenize-1.0.0-cp313-cp313-macosx_10_12_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

arrow_tokenize-1.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

arrow_tokenize-1.0.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ ARM64

arrow_tokenize-1.0.0-cp312-cp312-macosx_11_0_arm64.whl (2.1 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

arrow_tokenize-1.0.0-cp312-cp312-macosx_10_12_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

arrow_tokenize-1.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

arrow_tokenize-1.0.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ ARM64

arrow_tokenize-1.0.0-cp311-cp311-macosx_11_0_arm64.whl (2.1 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

arrow_tokenize-1.0.0-cp311-cp311-macosx_10_12_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

File details

Details for the file arrow_tokenize-1.0.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 45da5384c7dfce1db9f921de03b41baabaf352b4e91c0f4f261a459056dd86b0
MD5 9c9882a7b85a616e20f371c8b21cacac
BLAKE2b-256 c609763a1877989cd9159663c16362aa251995c80e03b9b9f9d98a2d9f8f8daa

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 b73437392e0baa58ef68ea373da60c7a98c305d49556b5b97fa23e4ce6f20608
MD5 b122ab0e5deb29c78412ebd83f761a2b
BLAKE2b-256 7e668cc736f767eb5e59771d083ea417d680ff94da657f87539faeef08a61aa5

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.0-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.0-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b7fb8d45e994447d07c2276136f308f8d5981dfd43695857993d3a3d8eaca3a4
MD5 e15a5984f86774420566b49f6027c385
BLAKE2b-256 8f24c32f682ab6b5d1f6b4bfaae14dfb64563c550141883f63ddbee6d5e261f5

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.0-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.0-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 a03e44c09f860e4b477e2b2328d33c40e7636ef32c286c8cb660b785768c0209
MD5 8a54358cbfbc7b2d11f35ae4ebc4f7b6
BLAKE2b-256 e8d6844aae1e482ce5f9a47ab2f4e4c97f2eba30455e74616d15fb6d67f268cc

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 44c829632f29717e482f5ea8c7f202bc5988938c838dbc9c253b347bcc8428d6
MD5 4181f3250c0ce29eefc5d10f2bebbd06
BLAKE2b-256 a59a16011f831aa91d19726df89bb6349490a8009472067991b1a9687ae8e235

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 f6312973f48f012a8b8024e5679e3291cf238dd7d9cd01b3a87c9f835601f28a
MD5 2f4f701aa62a0f0d6f01bc5c060be724
BLAKE2b-256 4768e581f00f6a775ddf102771bc91cb26cef5ccbe2f8c95f501eddeb72dbd6f

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 14cc2523f27ce0d1c64c7d9c03d8cb6ca5d8926768c4ccbf793a2ab98585be5d
MD5 86c46f5ff7a69af51496b65c5bacf781
BLAKE2b-256 c26043e68747d6f8a8f3eb38dc285f5bbb17e55284d27e45ac453b5410d6a6e5

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.0-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.0-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 5c4d45f85409912efd5a59f80a78afc3f9a1137a80ea1251ac905f4cf2d2aebe
MD5 407fcb4b04540daca9de71e774f0124e
BLAKE2b-256 468e43ced3ca42c0ca821d992749bb388cecf0de2ed2eb8a57fee8485dead6a6

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 69b3945b28e274d22dfcc7d72b8b6cad72a577e4289eb846fa35bb4f1048d375
MD5 983f653e2fc9c0c4cfff59ef305f7177
BLAKE2b-256 80f1e39f86efe90894013c31f5bb37a739b682fe0a2dcd37676f3e0b82ebcf97

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 6c4623b4b74d67904d6a6a9b1b9951743b4b5ed48908002ad573c4a696d9b4d0
MD5 24bf825df77df47c58f64a53a34c6002
BLAKE2b-256 7e831ed469097b13a59e097bad7973f33d2e6426381102d860d01d7533ac0e12

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7faab9e60f2fbcbf4b7fa4f7f46e69ee3c0cb40f6c8a90958a7f92619327f30d
MD5 020ebb3e6b1a581ea5db5a1cdcdc4366
BLAKE2b-256 10880bc61ad275c4989632a5c1441b738e982f1be38839a5f543d37f54b1ae0f

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.0-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.0-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 d9931208d59de3af0bf9b75727ae070886d7d2c63c9a9334d4ff2d9289776f51
MD5 dba74f5fdeb1cb41afd81c92a4a4fec4
BLAKE2b-256 eeffaed15ce85496949b6b8087c16a780867c0bba4ff095538c648abfd5d6ed9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page