Skip to main content

Make your tokenizer more syntax-friendly.

Project description

pypi

🚕 syntaxi

Make your tokenizer more syntax-friendly.

Syntaxi will encode capital words using a special shift-token, allowing words to be effectively capital-invariant. "Dog" and "dog" are the same word. Without Syntaxi, your language model needs to learn these words as if they were not the same.

Let your language model learn to think in terms of shift tokens, rather than learning words twice.

Getting started

Requirements

Python 3.11+, it's 2024.

Syntaxi only depends on regex for Unicode property escapes, and uses HuggingFace's tokenizers for convenience.

Installation

pip install syntaxi

Example

Load an existing, pre-trained HuggingFace tokenizer to be patched by Syntaxi.

Create directly using Tokenizer.from_pretrained

import syntaxi

tokenizer = syntaxi.huggingface_tokenizer("nilq/baby-tokenizer")
encoded = tokenizer.encode("My dog is a Dog, and my Dog is a dog.")

encoded.tokens
# ['[SHIFT]', '▁my', '▁dog', '▁is', '▁a', '▁', '[SHIFT]', '▁dog,', '▁and', '▁my', '▁', '[SHIFT]', '▁dog', '▁is', '▁a', '▁dog.']

tokenizer.decode(encoded.ids)
# "My dog is a Dog, and my Dog is a dog."

Manually patch tokenizer

import syntaxi
from tokenizers import Tokenizer

tokenizer: Tokenizer = ...

# Original `tokenizer` stays the same.
syntaxi_tokenizer = syntaxi.patched_tokenizer(tokenizer)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

syntaxi-0.1.3.tar.gz (3.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

syntaxi-0.1.3-py3-none-any.whl (4.4 kB view details)

Uploaded Python 3

File details

Details for the file syntaxi-0.1.3.tar.gz.

File metadata

  • Download URL: syntaxi-0.1.3.tar.gz
  • Upload date:
  • Size: 3.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.6 Darwin/23.3.0

File hashes

Hashes for syntaxi-0.1.3.tar.gz
Algorithm Hash digest
SHA256 909cf5dc5361dd6cb2242e84c892d85f0d38fe75466416468117f40c61db3d3e
MD5 3422328031f9ed4f3d5f73b730953b46
BLAKE2b-256 fe2c8c734c308ec59ccc41da476c1b851569a3efdaa93cb4a5d67c0e9de4975c

See more details on using hashes here.

File details

Details for the file syntaxi-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: syntaxi-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 4.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.6 Darwin/23.3.0

File hashes

Hashes for syntaxi-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 8af9507d8d541ab15312ebb4d8724b0110eeabc408c84451d9eccb280354e1d2
MD5 af85e40956fb19880538601bd84feb28
BLAKE2b-256 2d5d8fab4d610c92b0d42e2a1cced0b28a1eec44672d600fc7d8088718185b50

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page