Skip to main content

A byte pair encoding tokenizer for chess portable game notation (PGN)

Project description

PGN Tokenizer

PGN Tokenizer Visualization

This is a Byte Pair Encoding (BPE) tokenizer for chess Portable Game Notation (PGN).

Installation

You can install it with your package manager of choice:

uv

uv add pgn-tokenizer

pip

pip install pgn-tokenizer

Usage

It exposes a simple interface with .encode() and .decode() methods, and a .vocab_size property, but you can also access the underlying PreTrainedTokenizerFast class from the transformers library via the .tokenizer property.

from pgn_tokenizer import PGNTokenizer

# Initialize the tokenizer
tokenizer = PGNTokenizer()

# Tokenize a PGN string
tokens = tokenizer.encode("1.e4 Nf6 2.e5 Nd5 3.c4 Nb6")

# Decode the tokens back to a PGN string
decoded = tokenizer.decode(tokens)

# get vocab from underlying tokenizer class
vocab = tokenizer.tokenizer.get_vocab()

Implementation

It is uses the tokenizers library from Hugging Face for training the tokenizer and the transformers library from Hugging Face for initializing the tokenizer from the pretrained tokenizer model for faster tokenization.

Note: This is part of a work-in-progress project to investigate how language models might understand chess without an engine or any chess-specific knowledge.

Tokenizer Comparison

More traditional, language-focused BPE tokenizer implementations are not suited for PGN strings because they are more likely to break the actual moves apart.

For example 1.e4 Nf6 would likely be tokenized as 1, ., e, 4, N, f, 6 or 1, .e, 4, , N, f, 6 depending on the tokenizer's vocabulary, but with the specialized PGN tokenizer it would be tokenized as 1., e4, Nf6.

Visualization

Here is a visualization of the vocabulary of this specialized PGN tokenizer compared to the BPE tokenizer vocabularies of the cl100k_base (the vocabulary for the gpt-3.5-turbo and gpt-4 models' tokenizer) and the o200k_base (the vocabulary for the gpt-4o model's tokenizer):

PGN Tokenizer

PGN Tokenizer Visualization

Note: The tokenizer was trained with ~2.8 Million chess games in PGN notation with a target vocabulary size of 4096.

GPT-3.5-turbo and GPT-4 Tokenizers

GPT-4 Tokenizer Visualization

GPT-4o Tokenizer

GPT-4o Tokenizer Visualization

Note: These visualizations were generated with a function adapted from an educational Jupyter Notebook in the tiktoken repository.

Acknowledgements

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pgn_tokenizer-0.1.5.tar.gz (1.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pgn_tokenizer-0.1.5-py3-none-any.whl (71.7 kB view details)

Uploaded Python 3

File details

Details for the file pgn_tokenizer-0.1.5.tar.gz.

File metadata

  • Download URL: pgn_tokenizer-0.1.5.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.5.30

File hashes

Hashes for pgn_tokenizer-0.1.5.tar.gz
Algorithm Hash digest
SHA256 90c7211e6e15429300feb401a518e9c81e531101328b6e9a8ae334d1cccea8a9
MD5 4f4395b806d5281e6cf517413a7dffc4
BLAKE2b-256 fa144a7d4375d2fe93651ed7f2e3cd1524f2c64c6b90804755b2e9cd6922d7fc

See more details on using hashes here.

File details

Details for the file pgn_tokenizer-0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for pgn_tokenizer-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 48f2f1dfbd0996c1ed0af50e81ec003e93e1cbf7359b765226524c7ceb8be82b
MD5 9d3648f20dca131f933560fc1c1e30db
BLAKE2b-256 6706df10a8be66acc9742f5f4494df133533618656e07309fae162cf3c134b95

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page