Skip to main content

Make your tokenizer more syntax-friendly.

Project description

pypi

🚕 syntaxi

Make your tokenizer more syntax-friendly.

Syntaxi will encode capital words using a special shift-token, allowing words to be effectively capital-invariant. "Dog" and "dog" are the same word. Without Syntaxi, your language model needs to learn these words as if they were not the same.

Let your language model learn to think in terms of shift tokens, rather than learning words twice.

Getting started

Requirements

Python 3.11+, it's 2024.

Syntaxi only depends on regex for Unicode property escapes, and uses HuggingFace's tokenizers for convenience.

Installation

pip install syntaxi

Example

Load an existing, pre-trained HuggingFace tokenizer to be patched by Syntaxi.

Create directly using Tokenizer.from_pretrained

import syntaxi

tokenizer = syntaxi.huggingface_tokenizer("nilq/baby-tokenizer")
encoded = tokenizer.encode("My dog is a Dog, and my Dog is a dog.")

encoded.tokens
# ['[SHIFT]', '▁my', '▁dog', '▁is', '▁a', '▁', '[SHIFT]', '▁dog,', '▁and', '▁my', '▁', '[SHIFT]', '▁dog', '▁is', '▁a', '▁dog.']

tokenizer.decode(encoded.ids)
# "My dog is a Dog, and my Dog is a dog."

Manually patch tokenizer

import syntaxi
from tokenizers import Tokenizer

tokenizer: Tokenizer = ...

# Original `tokenizer` stays the same.
syntaxi_tokenizer = syntaxi.patched_tokenizer(tokenizer)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

syntaxi-0.1.4.tar.gz (3.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

syntaxi-0.1.4-py3-none-any.whl (4.4 kB view details)

Uploaded Python 3

File details

Details for the file syntaxi-0.1.4.tar.gz.

File metadata

  • Download URL: syntaxi-0.1.4.tar.gz
  • Upload date:
  • Size: 3.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.6 Darwin/23.3.0

File hashes

Hashes for syntaxi-0.1.4.tar.gz
Algorithm Hash digest
SHA256 50e98d503a8b1f4b0363af6a89045fc2c79ecc1e7ee5ab69fa129f0044624f85
MD5 029cae55122c3d42c689e26307871ec7
BLAKE2b-256 92d45e4ff37d1e94e93c5ad510de757fe6ff5315542e08ffb67b269dc61761af

See more details on using hashes here.

File details

Details for the file syntaxi-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: syntaxi-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 4.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.6 Darwin/23.3.0

File hashes

Hashes for syntaxi-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 373e6dd7bf44e4c0f601c47173f7abf38d28041a0223703f4000b5dd9a7c5ab7
MD5 bb1ac899088635bb5e10686edd992d2a
BLAKE2b-256 aeb4751a929a8a92b06e2906b5551919ce9c818defbfaa9e0a7e97e2a915dec1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page