Skip to main content

A custom tokeniser with a 131,072-token vocabulary derived from 0.5B (val) and 1B (val+test) tokens in SlimPajama. Lite version of the tokeniser-py library. Uses a novel token generation algorithm and a dynamic programming-based segmentation method for fast, interpretable tokenisation, which can also be used for tokeniation on custom token maps.

Project description

📄 README.md

🧠 Custom Tokeniser Library

A high-performance, fully custom tokeniser built from scratch — no BPE, no existing NLP tokenisation scheme. This tokeniser is based on a unique algorithm developed independently and trained on over 1 billion tokens from the SlimPajama dataset (Val + Test), providing an efficient, interpretable, and extendable tokenisation pipeline.

🚀 What This Library Offers

  • Tokeniser built on a vocabulary of 131,072 tokens
  • Two versions of vocab:
    • 0.5B: Validation-only data
    • 1B: Validation + Test data
  • Token vocab built via a custom algorithm — no Byte Pair Encoding (BPE)
  • Tokenisation logic includes:
    • Token lookup from pre-generated token map
    • Dynamic programming-based segmentation for out-of-vocab tokens
    • One-hot encoding (NumPy or PyTorch)
    • Visualisation utilities for tokens and token IDs
  • Lightweight JSON format for token maps & token count maps
  • Ready for integration into any LLM pre-tokenisation pipeline

Note: Files (chunked less than 2GB) are stored on Hugging Face instead of GitHub due to LFS file size constraints. On GitHub (files chunked below 100MB) are available.

📦 Installation

pip install tokeniser-py-lite

🛠 Usage

from tokeniser import Tokeniser

t = Tokeniser()
tokens, count = t.tokenise("Your input text here.")
token_ids = t.token_ids(tokens)

Use t.one_hot_tokens(token_ids) for NumPy-based one-hot encoding, or op='torch' for PyTorch.

📚 Data Sources

All token maps and token counts are generated from the SlimPajama dataset by Cerebras.

📁 Vocab Files

  • ordered_tokenizer_1b_val_test_data.json — Ordered tokens (1B data)
  • unordered_tokenizer_1b_val_test_data.json — Unordered tokens (1B)
  • count_tokenizer_1b_val_test_data.json — Token counts (1B)
  • (Similar structure for 0.5B val-only version)

📌 Design Philosophy

This tokeniser is built from scratch before learning existing algorithms like BPE. It is designed with the intent to understand, innovate, and compare with existing solutions from first principles.

Some parts may overlap with BPE/WordPiece in spirit — but the core algorithm was independently designed.

🤝 Contributions

Feel free to contribute anything via GitHub.

📖 License

MIT License

📄 CHANGELOG

📦 Changelog

[0.1.0] - 2025-03-22

Added

  • Initial release of custom tokeniser library
  • Light-weight version of the original python library tokeniser-py
  • Tokeniser class with support for:
    • tokenise() using DP segmentation
    • Custom token map and count map loading
    • One-hot encoding support (NumPy & PyTorch)
    • Token and token ID visualisation functions
    • token_map(), token_count_map(), max_token_length() accessors
  • Full support for:
    • 0.5B val-only vocab
    • 1B val + test vocab
  • JSON-based token and count maps from SlimPajama corpus

Notes

  • Built on top of a custom token creation algorithm not based on any standard BPE/WordPiece method
  • SlimPajama dataset used for vocab extraction
  • Token count files are optimized to stay under 2GB for compatibility with Git LFS (and Hugging Face storage)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokeniser-py-lite-0.1.0.tar.gz (4.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tokeniser_py_lite-0.1.0-py3-none-any.whl (4.9 MB view details)

Uploaded Python 3

File details

Details for the file tokeniser-py-lite-0.1.0.tar.gz.

File metadata

  • Download URL: tokeniser-py-lite-0.1.0.tar.gz
  • Upload date:
  • Size: 4.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.10.8

File hashes

Hashes for tokeniser-py-lite-0.1.0.tar.gz
Algorithm Hash digest
SHA256 83273b9206188dd62a4a8575dd9e4ccd1a0aaa72dc5a3807e20c4508b8bff041
MD5 efdfcab90c900b7046211b925708cf9b
BLAKE2b-256 3ce06bb55a618317a3d13dd4a5c49b91dcb6b4ec05e0a2c97444ac469e021705

See more details on using hashes here.

File details

Details for the file tokeniser_py_lite-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for tokeniser_py_lite-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 85506cba0e5fd4982ae6a8579a4f056ac583190b75698130ba9e18d4bf1e0c17
MD5 1557a2307df848ca990ef0c722519eef
BLAKE2b-256 0ddcae3a3d840805d707c3754a208f77450ee4a436db7d3f03fa4f81d3ea04e8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page