Vocabulary-free, training-free, deterministic and reversible text embeddings via harmonic modular projection (HTP).

These details have not been verified by PyPI

Project links

Project description

🎵 Harmonic Token Projection (HTP)

A vocabulary-free, training-free, deterministic and reversible text-embedding methodology.
HTP encodes each token analytically as a harmonic trajectory derived from its Unicode integer representation, establishing a bijective and interpretable mapping between discrete symbols and a continuous vector space — with no learned parameters, no corpus, and no randomness.

📘 Reference
Schmitz, T. (2025). Harmonic Token Projection (HTP): A Vocabulary-Free, Training-Free, Deterministic, and Reversible Embedding Methodology.
arXiv: 2511.20665 · DOI: 10.5281/zenodo.17575155

🔖 Key Features

🚫 No training, no vocabulary — pure analytic transform, works on any Unicode string
🔁 Fully reversible — exact token recovery via the Chinese Remainder Theorem
🎯 Deterministic — identical input always yields identical output (no randomness)
🪶 Lightweight — sub-megabyte footprint, sub-millisecond per sentence pair, CPU-only
🔍 Interpretable — every coordinate is a harmonic of a modular residue
🌍 Language-agnostic — ρ ≈ 0.68–0.70 (EN) and ρ ≈ 0.64 averaged over 10 languages on STS-B
🧩 Minimal dependencies — only numpy (optional jieba, scipy, datasets)

📦 Installation

pip install harmonic-token-projection

Optional extras:

pip install 'harmonic-token-projection[zh]'    # jieba segmenter for Chinese
pip install 'harmonic-token-projection[eval]'  # scipy + datasets for STS evaluation
pip install 'harmonic-token-projection[dev]'   # test / lint / build tooling

⚙️ How It Works

For a token t = [c₁, …, c_ℓ]:

Step	Equation	Description
1. Unicode	`uᵢ = ord(cᵢ)`	character → code point
2. Padding	`ũ = [u₁,…,uₗ,0,…,0]`	zero-pad to fixed length `L_max`
3. Integer	`Nₜ = Σ ũⱼ·B^(L_max−j)`, `B = 2¹⁶`	read as a base-`B` number
4. Residues	`rᵢ = Nₜ mod mᵢ`	decompose over pairwise-coprime moduli
5. Harmonics	`Eᵢ = [sin(2πrᵢ/mᵢ), cos(2πrᵢ/mᵢ)]`	project each residue → `E(t) ∈ ℝ²ᵏ`

Inversion recovers each residue from its phase r̃ᵢ = round(atan2(sᵢ,cᵢ)/2π · mᵢ) and reconstructs Nₜ via the Chinese Remainder Theorem, then decodes the base-B digits back to characters. By default HTP uses the first k = D/2 primes as moduli, which are pairwise coprime and give a modulus product M large enough to make every token up to model.reversible_max_len characters exactly reversible.

🧪 Detailed Examples

1️⃣ Token-level: deterministic & reversible

from htp import HTP

model = HTP(dim=512, max_len=32)

vec = model.encode_token("harmonic")   # numpy array, shape (512,)
print(vec.shape)                        # (512,)
print(model.decode_token(vec))          # -> 'harmonic'   (lossless)
print(model.token_to_int("harmonic"))   # -> deterministic integer Nₜ
print(model.reversible_max_len)          # -> 143 (chars guaranteed to round-trip)

2️⃣ Sentence-level: harmonic pooling & similarity

from htp import HTP

model = HTP(dim=512)

emb = model.encode("the cat sat on the mat")        # (512,) L2-normalized
mat = model.encode_batch(["first sentence",
                          "second one"])             # (2, 512)

print(model.similarity("a man is playing a guitar",
                       "a person plays the guitar")) # ~0.44
print(model.similarity("a man is playing a guitar",
                       "the stock market fell"))     # ~-0.03

3️⃣ Frequency-aware pooling (ITF / TF-IDF)

from htp import HTP

corpus = ["the cat sat on the mat", "a dog ran in the park", "the bird flew away"]

model = HTP(dim=512, pooling="tfidf")
model.fit(corpus)        # collects token frequencies — trains NO parameters
emb = model.encode("the rare cat")   # common words ("the") down-weighted

Pooling strategies (pooling=...):

Strategy	Weighting
`"itf"` (default)	Inverse Token Frequency `w = 1/log(1+f(t))`
`"tfidf"`	TF-IDF (call `model.fit(corpus)` first)
`"mean"`	uniform
`"stopword"`	drop stopwords, then mean

4️⃣ Multilingual round-trip & STS evaluation

from htp import HTP
from htp.evaluate import evaluate_pairs   # requires the [eval] extra

model = HTP(dim=512, max_len=32)

# Reversible across scripts
for t in ["représentation", "Schlüssel", "coração", "язык", "日本語"]:
    assert model.decode_token(model.encode_token(t)) == t

# Correlate against human similarity judgments
pairs = [("a man is eating food", "a man eats something"),
         ("a plane is taking off", "a dog is running")]
gold  = [4.2, 0.5]
print(evaluate_pairs(model, pairs, gold))   # {'spearman': ..., 'pearson': ...}

🧰 API Reference

HTP(dim=512, max_len=32, moduli=None, pooling="itf",
    tokenizer="regex", lowercase=False, stopwords="en")

model.encode_token(token)        # str      -> ndarray (dim,)
model.decode_token(vector)       # ndarray  -> str
model.token_to_int(token)        # str      -> int  (Nₜ)
model.int_to_token(value)        # int      -> str
model.encode(text, pooling=None) # str      -> ndarray (dim,)
model.encode_batch(texts)        # list     -> ndarray (n, dim)
model.similarity(a, b)           # str, str -> float
model.fit(corpus)                # collect ITF/TF-IDF statistics
model.reversible_max_len         # max token length guaranteed to round-trip

🔬 Properties

Property	HTP
Training	none (analytic)
Vocabulary	none (any Unicode string)
Determinism	identical input → identical output
Reversibility	exact token recovery via CRT
Footprint	sub-megabyte, sub-millisecond, CPU-only
Interpretability	each coordinate is a harmonic of a modular residue

📖 Citation

@article{schmitz2025htp,
  title   = {Harmonic Token Projection (HTP): A Vocabulary-Free, Training-Free,
             Deterministic, and Reversible Embedding Methodology},
  author  = {Schmitz, Tcharlies},
  journal = {arXiv preprint arXiv:2511.20665},
  year    = {2025},
  doi     = {10.5281/zenodo.17575155}
}

📝 License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

May 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

harmonic_token_projection-0.1.0.tar.gz (18.1 kB view details)

Uploaded May 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

harmonic_token_projection-0.1.0-py3-none-any.whl (20.4 kB view details)

Uploaded May 26, 2026 Python 3

File details

Details for the file harmonic_token_projection-0.1.0.tar.gz.

File metadata

Download URL: harmonic_token_projection-0.1.0.tar.gz
Upload date: May 26, 2026
Size: 18.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for harmonic_token_projection-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`fb0dc2f9a87ea184677c2ead41d014a728084d57444f330891250c52575ad329`
MD5	`e870a5886ee61ac1f6013567b3815511`
BLAKE2b-256	`8064a5b2c833689a8ae868ea07933b315b25477e651c915581858dbae06cff99`

See more details on using hashes here.

File details

Details for the file harmonic_token_projection-0.1.0-py3-none-any.whl.

File metadata

Download URL: harmonic_token_projection-0.1.0-py3-none-any.whl
Upload date: May 26, 2026
Size: 20.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for harmonic_token_projection-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8c997063f202de8f5d87421a0c37c7eb981b8578c0a64264a29a59c5e5f8e40a`
MD5	`ba368203663d321d6aa2ec29a7fa31a0`
BLAKE2b-256	`b467145982f4c781eada08158006e8f703926b744c726721a5f70188890cdf8d`

See more details on using hashes here.

harmonic-token-projection 0.1.0

Navigation

Verified details

Project links

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🎵 Harmonic Token Projection (HTP)

🔖 Key Features

📦 Installation

⚙️ How It Works

🧪 Detailed Examples

1️⃣ Token-level: deterministic & reversible

2️⃣ Sentence-level: harmonic pooling & similarity

3️⃣ Frequency-aware pooling (ITF / TF-IDF)

4️⃣ Multilingual round-trip & STS evaluation

🧰 API Reference

🔬 Properties

📖 Citation

📝 License

Project details

Verified details

Project links

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes