Skip to main content

Fast Rust tokenization without Python garbage collection slowdowns

Project description

Rust-based Python Tokenizer to Arrow

This project provides a tiny Python extension written in Rust for tokenizing text data directly into Arrow arrays. It uses the tokenizers library from Hugging Face and serializes the output directly into a pyarrow.LargeListArray without exposing any Encoding objects to the Python interpreter.

The primary benefit of this approach is memory efficiency. By avoiding the creation of intermediate Python lists of integers, it significantly reduces the overhead on the Python garbage collector, making it ideal for tokenizing very large datasets. This can provide a major speedup for large batch tokenization jobs, which otherwise can get bottlenecked on Python GC.

Features

  • Very simple API
  • Provides a zero-copy data transfer of resulting Arrow arrays

Setup and Installation

Prerequisites

  • Install Rust
  • Install Python

Installation Steps

uv run maturin develop

Arrow-based Tokenization

Luxical bundles in the arrow-tokenize Rust extension package, which exposes ArrowTokenizer, a fast tokenizer abstraction that operates on Arrow arrays and returns Arrow arrays. This avoids Python‑level overhead during large batch tokenization. Compared to the Python API of the tokenizers library, arrow-tokenize dramatically reduces pressure on the Python garbage collector and can deliver substantial speedups in bulk tokenization.

Key Python API functions:

  • load_arrow_tokenizer_from_pretrained(tokenizer_id: str) -> ArrowTokenizer
  • load_arrow_tokenizer_from_file(tokenizer_file: Path | str) -> ArrowTokenizer
  • arrow_tokenize_texts(texts: list[str], arrow_tokenizer: ArrowTokenizer, *, batch_size=4096, add_special_tokens=False, progress_bar=True) -> pa.ChunkedArray

Minimal example:

from luxical.tokenization import (
    load_arrow_tokenizer_from_pretrained,
    arrow_tokenize_texts,
)

tok = load_arrow_tokenizer_from_pretrained("google-bert/bert-base-uncased")
chunks = arrow_tokenize_texts([
    "hello world",
    "lexical embeddings are fast",
], tok, batch_size=2, add_special_tokens=False, progress_bar=False)

Release Notes

v1.0.2 - 2025-11-24

  • Clarify supported Python versions in pyproject.toml

v1.0.1 - 2025-11-24

  • Correct license metadata in pyproject.toml

v1.0.0 - 2025-09-22

  • Initial release. Intended to be the only release until we need to bump dependency versions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

arrow_tokenize-1.0.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

arrow_tokenize-1.0.2-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ ARM64

arrow_tokenize-1.0.2-cp313-cp313-macosx_11_0_arm64.whl (2.1 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

arrow_tokenize-1.0.2-cp313-cp313-macosx_10_12_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

arrow_tokenize-1.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

arrow_tokenize-1.0.2-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ ARM64

arrow_tokenize-1.0.2-cp312-cp312-macosx_11_0_arm64.whl (2.1 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

arrow_tokenize-1.0.2-cp312-cp312-macosx_10_12_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

arrow_tokenize-1.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

arrow_tokenize-1.0.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ ARM64

arrow_tokenize-1.0.2-cp311-cp311-macosx_11_0_arm64.whl (2.1 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

arrow_tokenize-1.0.2-cp311-cp311-macosx_10_12_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

File details

Details for the file arrow_tokenize-1.0.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 308cb62adab2ddd93cf96934df4d36e92443020139e4a75cfcb94195777c1e91
MD5 aa6b36bad352fe8b9034fb254de25110
BLAKE2b-256 99ae4607c216bf9577a5b83dc7bf9a6501e121150faf4e7a73aca70bfae926ec

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.2-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.2-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 919adf87009569f32a1a57aa7c0b2f1a9219c497be3687b7677eacdbaca4a4cc
MD5 1eff5dff217acf5503fbdaa3866be1a2
BLAKE2b-256 67f4f2047a03d4334e81238967ab39ff1b07f874762deca45ca11dbcb87855fd

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.2-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.2-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0770b655fbb38811d1aed5f845633b9149066d10f45de6a625fe632417cdaca3
MD5 8c3a3fd4107fcff871b36af59028e4e4
BLAKE2b-256 99d45d440c4acafc9c10cdbda4ad70a0cc77ecc9a3262f13f565646a7c315017

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.2-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.2-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 4d3407f960414b9348630db7d3d7b2911c2bf07141a3922be8daecaced535cc8
MD5 23473bb0ac3ec5ddd93a5f1527260801
BLAKE2b-256 0f0d95777ad5cb201b5c41ea308da6057799193e4a4c65980c6d5c14e3657660

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6106213a724c628a48f169e6bdd43af4d3e69463b7234b968ca14636d8b56054
MD5 19bdcadd4beeca2bf1932410c54b9f55
BLAKE2b-256 a641c62edde9bdd43cdaa2574033d916c128fd0b6cfc927692c44ded31a515cc

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.2-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.2-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 75189a37c9439300a253156af1cf23316873a89129fabda768dc0adfbed81b7b
MD5 34fda9b66431a671fbd735fd76da2026
BLAKE2b-256 45d876dbda5f38cd2aa02737bbb29a44a5aa3103e571b42c19e3f49c653d8e72

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.2-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.2-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 852c1bf7598d8627a56fd95ccad90790778633b7b655804db14b90607fcf9fe5
MD5 20a1d2473d1519356c2d5082525bf031
BLAKE2b-256 5c7ebed226fd8ba41561eeb670d8179e85e85802b3073bbcd6f49270b57905c0

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.2-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.2-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 99786811223c788c6e76c828503aa93fdafa18986eba20068ef4b3636794c903
MD5 387cd5d103a3764c266dadf63a57ade9
BLAKE2b-256 be4aa37f725a596078dc23505642f2f52fd391bfe73ff3377916529a0acea2aa

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 af1a4c694fdbae42a7dcfdd77e00495cebd56f317d48c1dd81d5725228acb0a4
MD5 2d6667dae8222ad3d9bf0afba488d09e
BLAKE2b-256 b1c28515be39c07175434b226f2899b3d2312448ef1f5ac09fa45139b5003aed

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 2d67a45c989210bb68b80831863d99ef552ff10902c2406d724dd8381a4dc259
MD5 295476f3c388ba357cb10c7b0f9aa48f
BLAKE2b-256 8767be91604034be11bdb8831d67224158662f3033d31b87341509e5297a6bd9

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.2-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.2-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 43568bd98b0f92e4a4daa581cb1ac58a0786784afb5c6367893faf2a8a03aeca
MD5 29c7594e54196aa0d1050b4096580064
BLAKE2b-256 1a7b6699ece10b4c2074205b5bbebe4a722ad636b56435df91e682b2beae80c6

See more details on using hashes here.

File details

Details for the file arrow_tokenize-1.0.2-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for arrow_tokenize-1.0.2-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 aa6b7efe4388f7fdb2556db924461ae4edd1f99a6f94c330fcc19d6a3c830dcc
MD5 a15c2da35f13a4ffdd3d26bdeb4d0d40
BLAKE2b-256 a496ca4f1330b21b95a8fff920696da07b590fa72b0be397b8002a8b9b13b822

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page