Skip to main content

Make your tokenizer more syntax-friendly.

Project description

pypi

🚕 syntaxi

Make your tokenizer more syntax-friendly.

Syntaxi will encode capital words using a special shift-token, allowing words to be effectively capital-invariant. "Dog" and "dog" are the same word. Without Syntaxi, your language model needs to learn these words as if they were not the same.

Let your language model learn to think in terms of shift tokens, rather than learning words twice.

Getting started

Requirements

Python 3.11+, it's 2024.

Syntaxi only depends on regex for Unicode property escapes, and uses HuggingFace's tokenizers for convenience.

Installation

pip install syntaxi

Example

Load an existing, pre-trained HuggingFace tokenizer to be patched by Syntaxi.

Create directly using Tokenizer.from_pretrained

import syntaxi

tokenizer = syntaxi.huggingface_tokenizer("nilq/baby-tokenizer")
encoded = tokenizer.encode("My dog is a Dog, and my Dog is a dog.")

encoded.tokens
# ['[SHIFT]', '▁my', '▁dog', '▁is', '▁a', '▁', '[SHIFT]', '▁dog,', '▁and', '▁my', '▁', '[SHIFT]', '▁dog', '▁is', '▁a', '▁dog.']

tokenizer.decode(encoded.ids)
# "My dog is a Dog, and my Dog is a dog."

Manually patch tokenizer

import syntaxi
from tokenizers import Tokenizer

tokenizer: Tokenizer = ...

# Original `tokenizer` stays the same.
syntaxi_tokenizer = syntaxi.patched_tokenizer(tokenizer)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

syntaxi-0.1.5.tar.gz (3.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

syntaxi-0.1.5-py3-none-any.whl (4.5 kB view details)

Uploaded Python 3

File details

Details for the file syntaxi-0.1.5.tar.gz.

File metadata

  • Download URL: syntaxi-0.1.5.tar.gz
  • Upload date:
  • Size: 3.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.6 Darwin/23.3.0

File hashes

Hashes for syntaxi-0.1.5.tar.gz
Algorithm Hash digest
SHA256 f2d1d2d1b8b85530c359f2500253e66c65d19bb0b93a04f8c859ec3f59f7e379
MD5 8b7602004cb3fe131d79b6246917f1d5
BLAKE2b-256 2809abcac5d016e4aebac98fbadbe5c5ac87bd990f70557bcb413a96e9618cc3

See more details on using hashes here.

File details

Details for the file syntaxi-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: syntaxi-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 4.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.6 Darwin/23.3.0

File hashes

Hashes for syntaxi-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 6a1021d178867fcbcff6ba33835be6b5ac6d507a39c631dd7d1cb5ea18c12ccc
MD5 d76c8eaa7ed14cf4d5620bd845276dd8
BLAKE2b-256 7d78b864ea1ba35239c131eb550f2d6ca31d1ee547aeb5ea18a3b34dd5598e4e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page