Make your tokenizer more syntax-friendly.
Project description
🚕 syntaxi
Make your tokenizer more syntax-friendly.
Syntaxi will encode capital words using a special shift-token, allowing words to be effectively capital-invariant. "Dog" and "dog" are the same word. Without Syntaxi, your language model needs to learn these words as if they were not the same.
Let your language model learn to think in terms of shift tokens, rather than learning words twice.
Getting started
Requirements
Python 3.11+, it's 2024.
Syntaxi only depends on regex for Unicode property escapes, and uses HuggingFace's tokenizers for convenience.
Installation
pip install syntaxi
Example
Load an existing, pre-trained HuggingFace tokenizer to be patched by Syntaxi.
Create directly using Tokenizer.from_pretrained
import syntaxi
tokenizer = syntaxi.huggingface_tokenizer("nilq/baby-tokenizer")
encoded = tokenizer.encode("My dog is a Dog, and my Dog is a dog.")
encoded.tokens
# ['[SHIFT]', '▁my', '▁dog', '▁is', '▁a', '▁', '[SHIFT]', '▁dog,', '▁and', '▁my', '▁', '[SHIFT]', '▁dog', '▁is', '▁a', '▁dog.']
tokenizer.decode(encoded.ids)
# "My dog is a Dog, and my Dog is a dog."
Manually patch tokenizer
import syntaxi
from tokenizers import Tokenizer
tokenizer: Tokenizer = ...
# Original `tokenizer` stays the same.
syntaxi_tokenizer = syntaxi.patched_tokenizer(tokenizer)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file syntaxi-0.1.7.tar.gz.
File metadata
- Download URL: syntaxi-0.1.7.tar.gz
- Upload date:
- Size: 3.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.11.6 Darwin/23.3.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e524ddb4de3b8862e6a508263f1114e88c5b0625b6c0d0bd71b7e7afd731850e
|
|
| MD5 |
f813e96223ba7f535a07aa2f7fab9e6d
|
|
| BLAKE2b-256 |
7589f11be00f4c9293b1468974a435e5f7d7dd998d6b0dba34823d2973e43aad
|
File details
Details for the file syntaxi-0.1.7-py3-none-any.whl.
File metadata
- Download URL: syntaxi-0.1.7-py3-none-any.whl
- Upload date:
- Size: 4.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.11.6 Darwin/23.3.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
70c544417c261b05d762719ae1eac6db6e319306f6f0caad1fd7c0a109b27368
|
|
| MD5 |
8a1eef92c835c0f229db7d41999a40d9
|
|
| BLAKE2b-256 |
08dfc040e9ac4b89a0cb052db853ff4f680a4094e866ce1aafa8abe3023ec7eb
|