Skip to main content

Syntax-aware Bash tokenizer — Rust core, Python bindings

Project description

gotoken

Syntax-aware tokenizer for Bash and formal languages, written in Rust with Python bindings.

Why gotoken?

Standard BPE tokenizers (tiktoken, HuggingFace) fragment Bash constructs like 2>&1, &&, --help into 4-5 separate tokens, wasting context window and model parameters.

gotoken protects 130+ shell operators, coreutils commands and flags as atomic single tokens, then falls back to byte-level encoding for everything else.

Features

  • grep, chmod, find, 2>&1, &&, ||, -rf → single ID, always
  • Zero OOV: every byte maps to a fallback ID in [1000..1255]
  • Perfect round-trip: decode(encode(s)) == s guaranteed
  • VOCAB_SIZE = 32768 (power-of-two, Tensor Core aligned)
  • Rayon parallel batch encoding, GIL released during tokenization
  • Python 3.9+ via PyO3, installable with pip install gotoken

Install

pip install gotoken       # Python
cargo add gotoken         # Rust

Usage (Python)

from gotoken import GoToken

tok = GoToken()

ids  = tok.encode("grep -rf /tmp 2>&1")
text = tok.decode(ids)
assert text == "grep -rf /tmp 2>&1"

# Parallel batch — GIL released, rayon saturates all cores
results = tok.encode_batch(["find /var -name '*.log'", "chmod 755 /bin/app"])

Usage (Rust)

use gotoken::encoder::Encoder;

let enc  = Encoder::new();
let ids  = enc.encode_str("grep -rf /tmp 2>&1", false)?;
let text = enc.decode(&ids)?;
assert_eq!(text, "grep -rf /tmp 2>&1");

Compression vs tiktoken

Command tiktoken tokens gotoken tokens
grep -r 'TODO' . 2>&1 12 7
chmod 755 /var/www/html 10 6
find /home -name '*.log' | wc -l 14 8

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gotoken-0.1.1.tar.gz (30.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gotoken-0.1.1-cp313-cp313-manylinux_2_34_x86_64.whl (752.9 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

File details

Details for the file gotoken-0.1.1.tar.gz.

File metadata

  • Download URL: gotoken-0.1.1.tar.gz
  • Upload date:
  • Size: 30.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.0

File hashes

Hashes for gotoken-0.1.1.tar.gz
Algorithm Hash digest
SHA256 c8dd20b17da5de27f600176fed96d00903e1b39512450e85d19ea9ed453544d0
MD5 b26bfb18bbc540aa2f444351871a5ee9
BLAKE2b-256 88ab563cb5ba07f685c93ebda13c61d020ec7154818b6f098892f1e1bc411661

See more details on using hashes here.

File details

Details for the file gotoken-0.1.1-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for gotoken-0.1.1-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 e809f8222cbcb375794d9512b992d5db8551b4147af76d7941811c3f0a31a5c7
MD5 20f97539553c5dc12b5706ec8fbb01b7
BLAKE2b-256 0ac55cef752ff2e03790181dd06d2a08cc541d4eb7738efc05c04d092874c855

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page