Skip to main content

Make your tokenizer more syntax-friendly.

Project description

pypi

🚕 syntaxi

Make your tokenizer more syntax-friendly.

Syntaxi will encode capital words using a special shift-token, allowing words to be effectively capital-invariant. "Dog" and "dog" are the same word. Without Syntaxi, your language model needs to learn these words as if they were not the same.

Let your language model learn to think in terms of shift tokens, rather than learning words twice.

Getting started

Requirements

Python 3.11+, it's 2024.

Syntaxi only depends on regex for Unicode property escapes, and uses HuggingFace's tokenizers for convenience.

Installation

pip install syntaxi

Example

Load an existing, pre-trained HuggingFace tokenizer to be patched by Syntaxi.

Create directly using Tokenizer.from_pretrained

import syntaxi

tokenizer = syntaxi.huggingface_tokenizer("nilq/baby-tokenizer")
encoded = tokenizer.encode("My dog is a Dog, and my Dog is a dog.")

encoded.tokens
# ['[SHIFT]', '▁my', '▁dog', '▁is', '▁a', '▁', '[SHIFT]', '▁dog,', '▁and', '▁my', '▁', '[SHIFT]', '▁dog', '▁is', '▁a', '▁dog.']

tokenizer.decode(encoded.ids)
# "My dog is a Dog, and my Dog is a dog."

Manually patch tokenizer

import syntaxi
from tokenizers import Tokenizer

tokenizer: Tokenizer = ...

# Original `tokenizer` stays the same.
syntaxi_tokenizer = syntaxi.patched_tokenizer(tokenizer)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

syntaxi-0.1.7.tar.gz (3.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

syntaxi-0.1.7-py3-none-any.whl (4.5 kB view details)

Uploaded Python 3

File details

Details for the file syntaxi-0.1.7.tar.gz.

File metadata

  • Download URL: syntaxi-0.1.7.tar.gz
  • Upload date:
  • Size: 3.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.6 Darwin/23.3.0

File hashes

Hashes for syntaxi-0.1.7.tar.gz
Algorithm Hash digest
SHA256 e524ddb4de3b8862e6a508263f1114e88c5b0625b6c0d0bd71b7e7afd731850e
MD5 f813e96223ba7f535a07aa2f7fab9e6d
BLAKE2b-256 7589f11be00f4c9293b1468974a435e5f7d7dd998d6b0dba34823d2973e43aad

See more details on using hashes here.

File details

Details for the file syntaxi-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: syntaxi-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 4.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.6 Darwin/23.3.0

File hashes

Hashes for syntaxi-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 70c544417c261b05d762719ae1eac6db6e319306f6f0caad1fd7c0a109b27368
MD5 8a1eef92c835c0f229db7d41999a40d9
BLAKE2b-256 08dfc040e9ac4b89a0cb052db853ff4f680a4094e866ce1aafa8abe3023ec7eb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page