A custom tokeniser with a 131,072-token vocabulary derived from 0.5B (val) and 1B (val+test) tokens in SlimPajama. Uses a novel token generation algorithm and a dynamic programming-based segmentation method for fast, interpretable tokenisation, which can also be used for tokeniation on custom token maps.

These details have not been verified by PyPI

Project links

Homepage

Project description

📄 `README.md`

🧠 Custom Tokeniser Library

A high-performance, fully custom tokeniser built from scratch — no BPE, no existing NLP tokenisation scheme. This tokeniser is based on a unique algorithm developed independently and trained on over 1 billion tokens from the SlimPajama dataset (Val + Test), providing an efficient, interpretable, and extendable tokenisation pipeline.

🚀 What This Library Offers

Tokeniser built on a vocabulary of 131,072 tokens
Two versions of vocab:
- 0.5B: Validation-only data
- 1B: Validation + Test data
Token vocab built via a custom algorithm — no Byte Pair Encoding (BPE)
Tokenisation logic includes:
- Token lookup from pre-generated token map
- Dynamic programming-based segmentation for out-of-vocab tokens
- One-hot encoding (NumPy or PyTorch)
- Visualisation utilities for tokens and token IDs
Lightweight JSON format for token maps & token count maps
Ready for integration into any LLM pre-tokenisation pipeline

Note: Files (chunked less than 2GB) are stored on Hugging Face instead of GitHub due to LFS file size constraints. On GitHub (files chunked below 100MB) are available.

📦 Installation

pip install tokeniser-py

🛠 Usage

from tokeniser import Tokeniser

t = Tokeniser(ln='1b', token_ordered=True)
tokens, count = t.tokenise("Your input text here.")
token_ids = t.token_ids(tokens)

Use t.one_hot_tokens(token_ids) for NumPy-based one-hot encoding, or op='torch' for PyTorch.

📚 Data Sources

All token maps and token counts are generated from the SlimPajama dataset by Cerebras.

📁 Vocab Files

ordered_tokenizer_1b_val_test_data.json — Ordered tokens (1B data)
unordered_tokenizer_1b_val_test_data.json — Unordered tokens (1B)
count_tokenizer_1b_val_test_data.json — Token counts (1B)
(Similar structure for 0.5B val-only version)

📌 Design Philosophy

This tokeniser is built from scratch before learning existing algorithms like BPE. It is designed with the intent to understand, innovate, and compare with existing solutions from first principles.

Some parts may overlap with BPE/WordPiece in spirit — but the core algorithm was independently designed.

🤝 Contributions

Feel free to contribute anything via GitHub.

📖 License

MIT License

📄 `CHANGELOG`

📦 Changelog

[0.1.0] - 2025-03-22

Added

Initial release of custom tokeniser library
Tokeniser class with support for:
- tokenise() using DP segmentation
- Custom token map and count map loading
- One-hot encoding support (NumPy & PyTorch)
- Token and token ID visualisation functions
- token_map(), token_count_map(), max_token_length() accessors
Full support for:
- 0.5B val-only vocab
- 1B val + test vocab
JSON-based token and count maps from SlimPajama corpus

Notes

Built on top of a custom token creation algorithm not based on any standard BPE/WordPiece method
SlimPajama dataset used for vocab extraction
Token count files are optimized to stay under 2GB for compatibility with Git LFS (and Hugging Face storage)

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.4

Apr 3, 2025

0.1.3

Apr 3, 2025

0.1.2

Mar 22, 2025

0.1.1

Mar 22, 2025

This version

0.1.0

Mar 22, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokeniser-py-0.1.0.tar.gz (4.9 MB view details)

Uploaded Mar 22, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tokeniser_py-0.1.0-py3-none-any.whl (4.9 MB view details)

Uploaded Mar 22, 2025 Python 3

File details

Details for the file tokeniser-py-0.1.0.tar.gz.

File metadata

Download URL: tokeniser-py-0.1.0.tar.gz
Upload date: Mar 22, 2025
Size: 4.9 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.10.8

File hashes

Hashes for tokeniser-py-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`58a1122f761c2228318b47d33fa7651907f84718e058e6be4b7f773d4a24160c`
MD5	`836628d3338ecfce26fcfdf771d0999b`
BLAKE2b-256	`42ee8b07be439ace21eb0366be555d93261a7ba0dd3e6c3197d8bb6434e024ba`

See more details on using hashes here.

File details

Details for the file tokeniser_py-0.1.0-py3-none-any.whl.

File metadata

Download URL: tokeniser_py-0.1.0-py3-none-any.whl
Upload date: Mar 22, 2025
Size: 4.9 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.10.8

File hashes

Hashes for tokeniser_py-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a6ac8de6b5495a3774d3cfc1e4f98a026e9aedd70db68b0c65e7a0909bc52a39`
MD5	`8bf2e937a1839ee8a3be81e116a4855b`
BLAKE2b-256	`3b29bfd121b699cc3150e1e80d0ab1fd2e8e037880c0267242085a0c770f2d6d`

See more details on using hashes here.

tokeniser-py 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

📄 README.md

🧠 Custom Tokeniser Library

🚀 What This Library Offers

📦 Installation

🛠 Usage

📚 Data Sources

📁 Vocab Files

📌 Design Philosophy

🤝 Contributions

📖 License

📄 CHANGELOG

📦 Changelog

[0.1.0] - 2025-03-22

Added

Notes

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

📄 `README.md`

📄 `CHANGELOG`