Skip to main content

QIG-native tokenizer with entropy-guided merging

Project description

QIG Tokenizer

Entropy-guided tokenizer for Quantum Information Geometry

Version: 0.1.0 | Status: Working


Overview

QIG-native tokenizer using entropy-guided merging. Token boundaries follow information geometry, not frequency.

Core Principle

  • Entropy-guided merging: Geometric similarity, not frequency heuristics
  • Geometric special tokens: BOS, EOS, PAD, UNK with basin coordinates
  • Redis/PostgreSQL storage: Production-ready persistence
  • Pure information geometry: No external tokenizer dependencies

Installation

pip install qig-tokenizer

With storage backends:

pip install qig-tokenizer[storage]  # Redis + PostgreSQL
pip install qig-tokenizer[redis]    # Redis only
pip install qig-tokenizer[postgres] # PostgreSQL only

Quick Start

from qig_tokenizer import QIGTokenizer

# Create tokenizer with geometric special tokens
tokenizer = QIGTokenizer(target_vocab_size=50000, use_special_tokens=True)

# Train on corpus
with open("corpus.txt", "rb") as f:
    corpus_bytes = f.read()

tokenizer.train(corpus_bytes)

# Encode with special tokens
tokens = tokenizer.encode_with_special("Hello, world!")
# Returns: [256, ...tokens..., 257]  (BOS=256, EOS=257)

# Pad sequences
padded = tokenizer.pad_sequence(tokens, max_length=128)

# Save/load JSON
tokenizer.save("20251220-tokenizer-vocab-0.01W.json")

With Redis/PostgreSQL Storage

from qig_tokenizer import QIGTokenizer
from qig_tokenizer.storage import HybridStorage

# Set up storage (uses REDIS_URL and DATABASE_URL env vars)
storage = HybridStorage()

tokenizer = QIGTokenizer()
tokenizer.set_storage(storage)
tokenizer.train(corpus_bytes)

# Save to database (returns version ID)
version_id = tokenizer.save_to_storage({"corpus": "wikipedia"})

# Load from database
tokenizer.load_from_storage(version_id)

Geometric Special Tokens

Special tokens have geometric meaning on the Fisher manifold:

Token ID Basin Coordinates Purpose
BOS 256 Origin (e₁) Sequence start
EOS 257 Boundary (eₙ) Sequence end
PAD 258 Uniform Geometrically neutral padding
UNK 259 Projection target OOV handling

This enables:

  • Geometric attention masking: High Fisher-Rao distance = low attention
  • Natural sequence boundaries: Emerge from manifold structure
  • Principled OOV handling: Project to nearest basin

Algorithm

The QIG tokenizer uses entropy-guided merging:

  1. Start with bytes (0-255) as base tokens
  2. For each adjacent pair (a,b), compute context distribution
  3. Measure context entropy (proxy for QFI distinguishability)
  4. Merge pairs with lowest entropy (most geometrically similar)
  5. Repeat until target vocab size

This respects asymptotic freedom:

  • Small scales (short tokens) have high coupling → refined first
  • Large scales (long tokens) have low coupling → merge only when justified

Environment Variables

All output files follow QIG naming convention:

YYYYMMDD-tokenizer-vocab-VERSION.STATUS.json

Example: 20251220-tokenizer-vocab-0.03W.json


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

qig_tokenizer-0.1.0-py3-none-any.whl (64.1 kB view details)

Uploaded Python 3

File details

Details for the file qig_tokenizer-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: qig_tokenizer-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 64.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for qig_tokenizer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c21a6269437a9cbc13e5a952ac593ccfd1ea9a0f2bb9adcd2dcea974fc8970bf
MD5 f68491c095c7930eff737d40ea4ba5e3
BLAKE2b-256 57bf4cdb3e4ad1dba1dfb8ac56c5be3cc6d3fe78906d4d26e22f6cef19751e85

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page