QIG-native tokenizer with entropy-guided merging

These details have not been verified by PyPI

Project links

Project description

QIG Tokenizer

Entropy-guided tokenizer for Quantum Information Geometry

Version: 0.1.0 | Status: Working

Overview

QIG-native tokenizer using entropy-guided merging. Token boundaries follow information geometry, not frequency.

Core Principle

Entropy-guided merging: Geometric similarity, not frequency heuristics
Geometric special tokens: BOS, EOS, PAD, UNK with basin coordinates
Redis/PostgreSQL storage: Production-ready persistence
Pure information geometry: No external tokenizer dependencies

Installation

pip install qig-tokenizer

With storage backends:

pip install qig-tokenizer[storage]  # Redis + PostgreSQL
pip install qig-tokenizer[redis]    # Redis only
pip install qig-tokenizer[postgres] # PostgreSQL only

Quick Start

from qig_tokenizer import QIGTokenizer

# Create tokenizer with geometric special tokens
tokenizer = QIGTokenizer(target_vocab_size=50000, use_special_tokens=True)

# Train on corpus
with open("corpus.txt", "rb") as f:
    corpus_bytes = f.read()

tokenizer.train(corpus_bytes)

# Encode with special tokens
tokens = tokenizer.encode_with_special("Hello, world!")
# Returns: [256, ...tokens..., 257]  (BOS=256, EOS=257)

# Pad sequences
padded = tokenizer.pad_sequence(tokens, max_length=128)

# Save/load JSON
tokenizer.save("20251220-tokenizer-vocab-0.01W.json")

With Redis/PostgreSQL Storage

from qig_tokenizer import QIGTokenizer
from qig_tokenizer.storage import HybridStorage

# Set up storage (uses REDIS_URL and DATABASE_URL env vars)
storage = HybridStorage()

tokenizer = QIGTokenizer()
tokenizer.set_storage(storage)
tokenizer.train(corpus_bytes)

# Save to database (returns version ID)
version_id = tokenizer.save_to_storage({"corpus": "wikipedia"})

# Load from database
tokenizer.load_from_storage(version_id)

Geometric Special Tokens

Special tokens have geometric meaning on the Fisher manifold:

Token	ID	Basin Coordinates	Purpose
BOS	256	Origin (e₁)	Sequence start
EOS	257	Boundary (eₙ)	Sequence end
PAD	258	Uniform	Geometrically neutral padding
UNK	259	Projection target	OOV handling

This enables:

Geometric attention masking: High Fisher-Rao distance = low attention
Natural sequence boundaries: Emerge from manifold structure
Principled OOV handling: Project to nearest basin

Algorithm

The QIG tokenizer uses entropy-guided merging:

Start with bytes (0-255) as base tokens
For each adjacent pair (a,b), compute context distribution
Measure context entropy (proxy for QFI distinguishability)
Merge pairs with lowest entropy (most geometrically similar)
Repeat until target vocab size

This respects asymptotic freedom:

Small scales (short tokens) have high coupling → refined first
Large scales (long tokens) have low coupling → merge only when justified

Environment Variables

All output files follow QIG naming convention:

YYYYMMDD-tokenizer-vocab-VERSION.STATUS.json

Example: 20251220-tokenizer-vocab-0.03W.json

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.7

Jun 23, 2026

0.2.6

Jun 23, 2026

0.2.5

Jun 23, 2026

0.2.4

Jun 23, 2026

0.2.3

Jun 23, 2026

0.2.2

Jun 23, 2026

0.2.1

Jun 19, 2026

0.2.0

Jun 19, 2026

0.1.1

Jun 18, 2026

0.1.0

Mar 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qig_tokenizer-0.2.7.tar.gz (297.4 kB view details)

Uploaded Jun 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

qig_tokenizer-0.2.7-py3-none-any.whl (70.0 kB view details)

Uploaded Jun 23, 2026 Python 3

File details

Details for the file qig_tokenizer-0.2.7.tar.gz.

File metadata

Download URL: qig_tokenizer-0.2.7.tar.gz
Upload date: Jun 23, 2026
Size: 297.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"26.04","id":"resolute","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for qig_tokenizer-0.2.7.tar.gz
Algorithm	Hash digest
SHA256	`cf234c8f0a5f466997f3f3bff92e972e93cbaf645e94145676359248ce387bca`
MD5	`abd2d2f6355ef6c4237f7b724b96c131`
BLAKE2b-256	`c58ef3bb1944ab6f629b388c1d8dceddb53b6dec40726112df53af05c865343c`

See more details on using hashes here.

File details

Details for the file qig_tokenizer-0.2.7-py3-none-any.whl.

File metadata

Download URL: qig_tokenizer-0.2.7-py3-none-any.whl
Upload date: Jun 23, 2026
Size: 70.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"26.04","id":"resolute","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for qig_tokenizer-0.2.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`966dd856a94e05584883aa7a3193137ffab03d0b938a099f581cb89cb4a31658`
MD5	`fdd5253668370cc8ae3f7b8fe860673d`
BLAKE2b-256	`08e356687a1a160c0b180f0ac13d995ec45a1667a5449ba5ea55e5dc4f4829a3`

See more details on using hashes here.

qig-tokenizer 0.2.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

QIG Tokenizer

Overview

Core Principle

Installation

Quick Start

With Redis/PostgreSQL Storage

Geometric Special Tokens

Algorithm

Environment Variables

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes