Skip to main content

Make your tokenizer more syntax-friendly.

Project description

🚕 syntaxi

Make your tokenizer more syntax-friendly.

Syntaxi will encode capital words using a special shift-token, allowing words to be effectively capital-invariant. "Dog" and "dog" are the same word. Without Syntaxi, your language model need to learn these words as if they were not the same.

Let your language model learn to think in terms of shift tokens, rather than learning words twice.

Getting started

Load an existing, pre-trained HuggingFace tokenizer to be patched by Syntaxi.

Create directly using Tokenizer.from_pretrained

import syntaxi

tokenizer = syntaxi.huggingface_tokenizer("nilq/baby-tokenizer")
encoded = tokenizer.encode("My dog is a Dog, and my Dog is a dog.")

encoded.tokens
# ['[SHIFT]', '▁my', '▁dog', '▁is', '▁a', '▁', '[SHIFT]', '▁dog,', '▁and', '▁my', '▁', '[SHIFT]', '▁dog', '▁is', '▁a', '▁dog.']

tokenizer.decode(encoded.ids)
# "My dog is a Dog, and my Dog is a dog."

Manually patch tokenizer

import syntaxi

tokenizer = ...

# Original `tokenizer` stays the same.
syntaxi_tokenizer = syntaxi.patched_tokenizer(tokenizer)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

syntaxi-0.1.2.tar.gz (3.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

syntaxi-0.1.2-py3-none-any.whl (4.2 kB view details)

Uploaded Python 3

File details

Details for the file syntaxi-0.1.2.tar.gz.

File metadata

  • Download URL: syntaxi-0.1.2.tar.gz
  • Upload date:
  • Size: 3.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.6 Darwin/23.3.0

File hashes

Hashes for syntaxi-0.1.2.tar.gz
Algorithm Hash digest
SHA256 45b16807ef5af7db208842a56b461e25ec8f2db44c5ca0fe9672bbd27c04b648
MD5 406c2f5213af1ef9ff9b0398d8b26566
BLAKE2b-256 6197a2a8c5a349d020d0c82811dc34a58b1516774ff18698a1f2224769325d13

See more details on using hashes here.

File details

Details for the file syntaxi-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: syntaxi-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 4.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.6 Darwin/23.3.0

File hashes

Hashes for syntaxi-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e40c574ba676ca51a54dc558a90d226980982e48376e1c19b31d287ef3eb63d1
MD5 6a07471af270621524f5ff88520eaa60
BLAKE2b-256 9a12607e7c7afa79696655266b5c88cb3ef6375780b594639d53b3029447b49f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page