A custom tokeniser with a 131,072-token vocabulary derived from 0.5B (val) and 1B (val+test) tokens in SlimPajama. Uses a novel token generation algorithm and a dynamic programming-based segmentation method for fast, interpretable tokenisation, which can also be used for tokeniation on custom token maps.
Project description
📄 README.md
🔣 tokeniser-py
Imp Links: PyPI Library | PyPI Lite Library (tokeniser-py-lite) | Lite Library GitHub (tokeniser-py-lite) | Demo (HF Spaces) | Complete repo (unchunked) - HF | Complete repo (chunked) - GitHub | Imp Files Github
A high-performance, fully custom tokeniser built from scratch — no BPE, no existing NLP tokenisation scheme. This tokeniser is based on a unique algorithm developed independently and trained on over 1 billion tokens from the SlimPajama dataset (Val + Test), providing an efficient, interpretable, and extendable tokenisation pipeline.
🚀 What This Library Offers
- Tokeniser built on a vocabulary of 131,072 tokens
- Two versions of vocab:
0.5B: Validation-only data1B: Validation + Test data
- Token vocab built via a custom algorithm — no Byte Pair Encoding (BPE)
- Tokenisation logic includes:
- Token lookup from pre-generated token map
- Dynamic programming-based segmentation for out-of-vocab tokens
- One-hot encoding (NumPy or PyTorch)
- Visualisation utilities for tokens and token IDs
- Lightweight JSON format for token maps & token count maps
- Ready for integration into any LLM pre-tokenisation pipeline
Note: Files (chunked less than 2GB) are stored on Hugging Face instead of GitHub due to LFS file size constraints. On GitHub (files chunked below 100MB) are available.
📦 Installation
pip install tokeniser-py
🛠 Usage
from tokeniser import Tokeniser
t = Tokeniser()
tokens, count = t.tokenise("Your input text here.")
token_ids = t.token_ids(tokens)
Use t.one_hot_tokens(token_ids) for NumPy-based one-hot encoding, or op='torch' for PyTorch.
📚 Data Sources
All token maps and token counts are generated from the SlimPajama dataset by Cerebras.
📁 Vocab Files
ordered_tokenizer_1b_val_test_data.json— Ordered tokens (1B data)unordered_tokenizer_1b_val_test_data.json— Unordered tokens (1B)count_tokenizer_1b_val_test_data.json— Token counts (1B)- (Similar structure for 0.5B val-only version)
📌 Design Philosophy
This tokeniser is built from scratch before learning existing algorithms like BPE. It is designed with the intent to understand, innovate, and compare with existing solutions from first principles.
Some parts may overlap with BPE/WordPiece in spirit — but the core algorithm was independently designed.
🤝 Contributions
Feel free to contribute anything via GitHub.
📖 License
MIT License
📄 CHANGELOG
📦 Changelog
[0.1.0] - 2025-03-22
Added
- Initial release of custom tokeniser library
- Tokeniser class with support for:
tokenise()using DP segmentation- Custom token map and count map loading
- One-hot encoding support (NumPy & PyTorch)
- Token and token ID visualisation functions
token_map(),token_count_map(),max_token_length()accessors
- Full support for:
- 0.5B val-only vocab
- 1B val + test vocab
- JSON-based token and count maps from SlimPajama corpus
[0.1.1] - 2025-03-22
Added
- Changed default import code in README, showing class instance creation with default params.
[0.1.2] - 2025-03-22
Added
- Changed the url to the actual GitHub repo.
[0.1.3] - 2025-04-04
Added
- Changed the raise exception to raise value error, for better error message
- Corrected the licence year to 2025.
[0.1.4] - 2025-04-04
Added
- Updated README.md
Notes
- Built on top of a custom token creation algorithm not based on any standard BPE/WordPiece method
- SlimPajama dataset used for vocab extraction
- Token count files are optimized to stay under 2GB for compatibility with Git LFS (and Hugging Face storage)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tokeniser-py-0.1.4.tar.gz.
File metadata
- Download URL: tokeniser-py-0.1.4.tar.gz
- Upload date:
- Size: 4.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.10.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0df2262b897bb505b477fa8bf853502012b1d09897345189747621c11ea135d5
|
|
| MD5 |
9b8ed7f8f0dfa946c3f46ba00e9244a6
|
|
| BLAKE2b-256 |
bb847555fdfb947fad34875c68fc3e1dc0b34a581d31b478ab9eb40c6e8e6c83
|
File details
Details for the file tokeniser_py-0.1.4-py3-none-any.whl.
File metadata
- Download URL: tokeniser_py-0.1.4-py3-none-any.whl
- Upload date:
- Size: 4.9 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.10.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ba38e8db4d7bd9763456cca4960cda41095a8e28ad90c2b78c4f70eda5c24ea3
|
|
| MD5 |
a5a24c48a4776c794fc1b179cca31c7d
|
|
| BLAKE2b-256 |
4ccdf321f52ada3e4c41541d05a84b51a83c71814170f00a8a23cabea2f6bfdc
|