Skip to main content

Make your tokenizer more syntax-friendly.

Project description

pypi

🚕 syntaxi

Make your tokenizer more syntax-friendly.

Syntaxi will encode capital words using a special shift-token, allowing words to be effectively capital-invariant. "Dog" and "dog" are the same word. Without Syntaxi, your language model needs to learn these words as if they were not the same.

Let your language model learn to think in terms of shift tokens, rather than learning words twice.

Getting started

Requirements

Python 3.11+, it's 2024.

Syntaxi only depends on regex for Unicode property escapes, and uses HuggingFace's tokenizers for convenience.

Installation

pip install syntaxi

Example

Load an existing, pre-trained HuggingFace tokenizer to be patched by Syntaxi.

Create directly using Tokenizer.from_pretrained

import syntaxi

tokenizer = syntaxi.huggingface_tokenizer("nilq/baby-tokenizer")
encoded = tokenizer.encode("My dog is a Dog, and my Dog is a dog.")

encoded.tokens
# ['[SHIFT]', '▁my', '▁dog', '▁is', '▁a', '▁', '[SHIFT]', '▁dog,', '▁and', '▁my', '▁', '[SHIFT]', '▁dog', '▁is', '▁a', '▁dog.']

tokenizer.decode(encoded.ids)
# "My dog is a Dog, and my Dog is a dog."

Manually patch tokenizer

import syntaxi
from tokenizers import Tokenizer

tokenizer: Tokenizer = ...

# Original `tokenizer` stays the same.
syntaxi_tokenizer = syntaxi.patched_tokenizer(tokenizer)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

syntaxi-0.1.6.tar.gz (3.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

syntaxi-0.1.6-py3-none-any.whl (4.5 kB view details)

Uploaded Python 3

File details

Details for the file syntaxi-0.1.6.tar.gz.

File metadata

  • Download URL: syntaxi-0.1.6.tar.gz
  • Upload date:
  • Size: 3.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.6 Darwin/23.3.0

File hashes

Hashes for syntaxi-0.1.6.tar.gz
Algorithm Hash digest
SHA256 c4f9c1bfe7d4440aa8f21d3c2a3d1b6c639e283f00bad05285a79f34d951f21f
MD5 2990df0e2c1a1759c2a927d3c737bd36
BLAKE2b-256 c0aad503c46ff08533dbdac3a9d932d1c71a52f87aa455d157fc8d6763bfab8d

See more details on using hashes here.

File details

Details for the file syntaxi-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: syntaxi-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 4.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.6 Darwin/23.3.0

File hashes

Hashes for syntaxi-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 2fab989ef9fb8e6e4d5b762c96498afb9c17bc9fb2c600cf93f5009761c70506
MD5 c37ee3f1f1961c7185a17b165e2a03d5
BLAKE2b-256 af8ca40ce3ba73a1b024d1abc4b13287431274e9bfaa61daefaf5cebcf40c1c3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page