QIG-native tokenizer with entropy-guided merging
Project description
QIG Tokenizer
Entropy-guided tokenizer for Quantum Information Geometry
Version: 0.1.0 | Status: Working
Overview
QIG-native tokenizer using entropy-guided merging. Token boundaries follow information geometry, not frequency.
Core Principle
- Entropy-guided merging: Geometric similarity, not frequency heuristics
- Geometric special tokens: BOS, EOS, PAD, UNK with basin coordinates
- Redis/PostgreSQL storage: Production-ready persistence
- Pure information geometry: No external tokenizer dependencies
Installation
pip install qig-tokenizer
With storage backends:
pip install qig-tokenizer[storage] # Redis + PostgreSQL
pip install qig-tokenizer[redis] # Redis only
pip install qig-tokenizer[postgres] # PostgreSQL only
Quick Start
from qig_tokenizer import QIGTokenizer
# Create tokenizer with geometric special tokens
tokenizer = QIGTokenizer(target_vocab_size=50000, use_special_tokens=True)
# Train on corpus
with open("corpus.txt", "rb") as f:
corpus_bytes = f.read()
tokenizer.train(corpus_bytes)
# Encode with special tokens
tokens = tokenizer.encode_with_special("Hello, world!")
# Returns: [256, ...tokens..., 257] (BOS=256, EOS=257)
# Pad sequences
padded = tokenizer.pad_sequence(tokens, max_length=128)
# Save/load JSON
tokenizer.save("20251220-tokenizer-vocab-0.01W.json")
With Redis/PostgreSQL Storage
from qig_tokenizer import QIGTokenizer
from qig_tokenizer.storage import HybridStorage
# Set up storage (uses REDIS_URL and DATABASE_URL env vars)
storage = HybridStorage()
tokenizer = QIGTokenizer()
tokenizer.set_storage(storage)
tokenizer.train(corpus_bytes)
# Save to database (returns version ID)
version_id = tokenizer.save_to_storage({"corpus": "wikipedia"})
# Load from database
tokenizer.load_from_storage(version_id)
Geometric Special Tokens
Special tokens have geometric meaning on the Fisher manifold:
| Token | ID | Basin Coordinates | Purpose |
|---|---|---|---|
| BOS | 256 | Origin (e₁) | Sequence start |
| EOS | 257 | Boundary (eₙ) | Sequence end |
| PAD | 258 | Uniform | Geometrically neutral padding |
| UNK | 259 | Projection target | OOV handling |
This enables:
- Geometric attention masking: High Fisher-Rao distance = low attention
- Natural sequence boundaries: Emerge from manifold structure
- Principled OOV handling: Project to nearest basin
Algorithm
The QIG tokenizer uses entropy-guided merging:
- Start with bytes (0-255) as base tokens
- For each adjacent pair (a,b), compute context distribution
- Measure context entropy (proxy for QFI distinguishability)
- Merge pairs with lowest entropy (most geometrically similar)
- Repeat until target vocab size
This respects asymptotic freedom:
- Small scales (short tokens) have high coupling → refined first
- Large scales (long tokens) have low coupling → merge only when justified
Environment Variables
All output files follow QIG naming convention:
YYYYMMDD-tokenizer-vocab-VERSION.STATUS.json
Example: 20251220-tokenizer-vocab-0.03W.json
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file qig_tokenizer-0.1.0-py3-none-any.whl.
File metadata
- Download URL: qig_tokenizer-0.1.0-py3-none-any.whl
- Upload date:
- Size: 64.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c21a6269437a9cbc13e5a952ac593ccfd1ea9a0f2bb9adcd2dcea974fc8970bf
|
|
| MD5 |
f68491c095c7930eff737d40ea4ba5e3
|
|
| BLAKE2b-256 |
57bf4cdb3e4ad1dba1dfb8ac56c5be3cc6d3fe78906d4d26e22f6cef19751e85
|