2 projects
tokeniser-py
A custom tokeniser with a 131,072-token vocabulary derived from 0.5B (val) and 1B (val+test) tokens in SlimPajama. Uses a novel token generation algorithm and a dynamic programming-based segmentation method for fast, interpretable tokenisation, which can also be used for tokeniation on custom token maps.
tokeniser-py-lite
A custom tokeniser with a 131,072-token vocabulary derived from 0.5B (val) and 1B (val+test) tokens in SlimPajama. Lite version of the tokeniser-py library. Uses a novel token generation algorithm and a dynamic programming-based segmentation method for fast, interpretable tokenisation, which can also be used for tokeniation on custom token maps.