Vocabulary-free, training-free, deterministic and reversible text embeddings via harmonic modular projection (HTP).
Project description
🎵 Harmonic Token Projection (HTP)
A vocabulary-free, training-free, deterministic and reversible text-embedding methodology.
HTP encodes each token analytically as a harmonic trajectory derived from its Unicode integer representation, establishing a bijective and interpretable mapping between discrete symbols and a continuous vector space — with no learned parameters, no corpus, and no randomness.
📘 Reference
Schmitz, T. (2025). Harmonic Token Projection (HTP): A Vocabulary-Free, Training-Free, Deterministic, and Reversible Embedding Methodology.
arXiv: 2511.20665 · DOI: 10.5281/zenodo.17575155
🔖 Key Features
- 🚫 No training, no vocabulary — pure analytic transform, works on any Unicode string
- 🔁 Fully reversible — exact token recovery via the Chinese Remainder Theorem
- 🎯 Deterministic — identical input always yields identical output (no randomness)
- 🪶 Lightweight — sub-megabyte footprint, sub-millisecond per sentence pair, CPU-only
- 🔍 Interpretable — every coordinate is a harmonic of a modular residue
- 🌍 Language-agnostic — ρ ≈ 0.68–0.70 (EN) and ρ ≈ 0.64 averaged over 10 languages on STS-B
- 🧩 Minimal dependencies — only
numpy(optionaljieba,scipy,datasets)
📦 Installation
pip install harmonic-token-projection
Optional extras:
pip install 'harmonic-token-projection[zh]' # jieba segmenter for Chinese
pip install 'harmonic-token-projection[eval]' # scipy + datasets for STS evaluation
pip install 'harmonic-token-projection[dev]' # test / lint / build tooling
⚙️ How It Works
For a token t = [c₁, …, c_ℓ]:
| Step | Equation | Description |
|---|---|---|
| 1. Unicode | uᵢ = ord(cᵢ) |
character → code point |
| 2. Padding | ũ = [u₁,…,uₗ,0,…,0] |
zero-pad to fixed length L_max |
| 3. Integer | Nₜ = Σ ũⱼ·B^(L_max−j), B = 2¹⁶ |
read as a base-B number |
| 4. Residues | rᵢ = Nₜ mod mᵢ |
decompose over pairwise-coprime moduli |
| 5. Harmonics | Eᵢ = [sin(2πrᵢ/mᵢ), cos(2πrᵢ/mᵢ)] |
project each residue → E(t) ∈ ℝ²ᵏ |
Inversion recovers each residue from its phase r̃ᵢ = round(atan2(sᵢ,cᵢ)/2π · mᵢ) and reconstructs Nₜ via the Chinese Remainder Theorem, then decodes the base-B digits back to characters. By default HTP uses the first k = D/2 primes as moduli, which are pairwise coprime and give a modulus product M large enough to make every token up to model.reversible_max_len characters exactly reversible.
🧪 Detailed Examples
1️⃣ Token-level: deterministic & reversible
from htp import HTP
model = HTP(dim=512, max_len=32)
vec = model.encode_token("harmonic") # numpy array, shape (512,)
print(vec.shape) # (512,)
print(model.decode_token(vec)) # -> 'harmonic' (lossless)
print(model.token_to_int("harmonic")) # -> deterministic integer Nₜ
print(model.reversible_max_len) # -> 143 (chars guaranteed to round-trip)
2️⃣ Sentence-level: harmonic pooling & similarity
from htp import HTP
model = HTP(dim=512)
emb = model.encode("the cat sat on the mat") # (512,) L2-normalized
mat = model.encode_batch(["first sentence",
"second one"]) # (2, 512)
print(model.similarity("a man is playing a guitar",
"a person plays the guitar")) # ~0.44
print(model.similarity("a man is playing a guitar",
"the stock market fell")) # ~-0.03
3️⃣ Frequency-aware pooling (ITF / TF-IDF)
from htp import HTP
corpus = ["the cat sat on the mat", "a dog ran in the park", "the bird flew away"]
model = HTP(dim=512, pooling="tfidf")
model.fit(corpus) # collects token frequencies — trains NO parameters
emb = model.encode("the rare cat") # common words ("the") down-weighted
Pooling strategies (pooling=...):
| Strategy | Weighting |
|---|---|
"itf" (default) |
Inverse Token Frequency w = 1/log(1+f(t)) |
"tfidf" |
TF-IDF (call model.fit(corpus) first) |
"mean" |
uniform |
"stopword" |
drop stopwords, then mean |
4️⃣ Multilingual round-trip & STS evaluation
from htp import HTP
from htp.evaluate import evaluate_pairs # requires the [eval] extra
model = HTP(dim=512, max_len=32)
# Reversible across scripts
for t in ["représentation", "Schlüssel", "coração", "язык", "日本語"]:
assert model.decode_token(model.encode_token(t)) == t
# Correlate against human similarity judgments
pairs = [("a man is eating food", "a man eats something"),
("a plane is taking off", "a dog is running")]
gold = [4.2, 0.5]
print(evaluate_pairs(model, pairs, gold)) # {'spearman': ..., 'pearson': ...}
🧰 API Reference
HTP(dim=512, max_len=32, moduli=None, pooling="itf",
tokenizer="regex", lowercase=False, stopwords="en")
model.encode_token(token) # str -> ndarray (dim,)
model.decode_token(vector) # ndarray -> str
model.token_to_int(token) # str -> int (Nₜ)
model.int_to_token(value) # int -> str
model.encode(text, pooling=None) # str -> ndarray (dim,)
model.encode_batch(texts) # list -> ndarray (n, dim)
model.similarity(a, b) # str, str -> float
model.fit(corpus) # collect ITF/TF-IDF statistics
model.reversible_max_len # max token length guaranteed to round-trip
🔬 Properties
| Property | HTP |
|---|---|
| Training | none (analytic) |
| Vocabulary | none (any Unicode string) |
| Determinism | identical input → identical output |
| Reversibility | exact token recovery via CRT |
| Footprint | sub-megabyte, sub-millisecond, CPU-only |
| Interpretability | each coordinate is a harmonic of a modular residue |
📖 Citation
@article{schmitz2025htp,
title = {Harmonic Token Projection (HTP): A Vocabulary-Free, Training-Free,
Deterministic, and Reversible Embedding Methodology},
author = {Schmitz, Tcharlies},
journal = {arXiv preprint arXiv:2511.20665},
year = {2025},
doi = {10.5281/zenodo.17575155}
}
📝 License
MIT © 2025 Tcharlies Schmitz — Data Science, PX.Center
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file harmonic_token_projection-0.1.0.tar.gz.
File metadata
- Download URL: harmonic_token_projection-0.1.0.tar.gz
- Upload date:
- Size: 18.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fb0dc2f9a87ea184677c2ead41d014a728084d57444f330891250c52575ad329
|
|
| MD5 |
e870a5886ee61ac1f6013567b3815511
|
|
| BLAKE2b-256 |
8064a5b2c833689a8ae868ea07933b315b25477e651c915581858dbae06cff99
|
File details
Details for the file harmonic_token_projection-0.1.0-py3-none-any.whl.
File metadata
- Download URL: harmonic_token_projection-0.1.0-py3-none-any.whl
- Upload date:
- Size: 20.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8c997063f202de8f5d87421a0c37c7eb981b8578c0a64264a29a59c5e5f8e40a
|
|
| MD5 |
ba368203663d321d6aa2ec29a7fa31a0
|
|
| BLAKE2b-256 |
b467145982f4c781eada08158006e8f703926b744c726721a5f70188890cdf8d
|