Syntax-aware Bash tokenizer — Rust core, Python bindings
Project description
gotoken
Syntax-aware tokenizer for Bash and formal languages, written in Rust with Python bindings.
Why gotoken?
Standard BPE tokenizers (tiktoken, HuggingFace) fragment Bash constructs like 2>&1, &&, --help into 4-5 separate tokens, wasting context window and model parameters.
gotoken protects 130+ shell operators, coreutils commands and flags as atomic single tokens, then falls back to byte-level encoding for everything else.
Features
grep,chmod,find,2>&1,&&,||,-rf→ single ID, always- Zero OOV: every byte maps to a fallback ID in
[1000..1255] - Perfect round-trip:
decode(encode(s)) == sguaranteed VOCAB_SIZE = 32768(power-of-two, Tensor Core aligned)- Rayon parallel batch encoding, GIL released during tokenization
- Python 3.9+ via PyO3, installable with
pip install gotoken
Install
pip install gotoken # Python
cargo add gotoken # Rust
Usage (Python)
from gotoken import GoToken
tok = GoToken()
ids = tok.encode("grep -rf /tmp 2>&1")
text = tok.decode(ids)
assert text == "grep -rf /tmp 2>&1"
# Parallel batch — GIL released, rayon saturates all cores
results = tok.encode_batch(["find /var -name '*.log'", "chmod 755 /bin/app"])
Usage (Rust)
use gotoken::encoder::Encoder;
let enc = Encoder::new();
let ids = enc.encode_str("grep -rf /tmp 2>&1", false)?;
let text = enc.decode(&ids)?;
assert_eq!(text, "grep -rf /tmp 2>&1");
Compression vs tiktoken
| Command | tiktoken tokens | gotoken tokens |
|---|---|---|
grep -r 'TODO' . 2>&1 |
12 | 7 |
chmod 755 /var/www/html |
10 | 6 |
find /home -name '*.log' | wc -l |
14 | 8 |
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gotoken-0.1.1.tar.gz.
File metadata
- Download URL: gotoken-0.1.1.tar.gz
- Upload date:
- Size: 30.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c8dd20b17da5de27f600176fed96d00903e1b39512450e85d19ea9ed453544d0
|
|
| MD5 |
b26bfb18bbc540aa2f444351871a5ee9
|
|
| BLAKE2b-256 |
88ab563cb5ba07f685c93ebda13c61d020ec7154818b6f098892f1e1bc411661
|
File details
Details for the file gotoken-0.1.1-cp313-cp313-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: gotoken-0.1.1-cp313-cp313-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 752.9 kB
- Tags: CPython 3.13, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e809f8222cbcb375794d9512b992d5db8551b4147af76d7941811c3f0a31a5c7
|
|
| MD5 |
20f97539553c5dc12b5706ec8fbb01b7
|
|
| BLAKE2b-256 |
0ac55cef752ff2e03790181dd06d2a08cc541d4eb7738efc05c04d092874c855
|