Skip to main content

Vocabulary-free, training-free, deterministic and reversible text embeddings via harmonic modular projection (HTP).

Project description

PyPI version License: MIT Python GitHub

🎵 Harmonic Token Projection (HTP)

A vocabulary-free, training-free, deterministic and reversible text-embedding methodology.
HTP encodes each token analytically as a harmonic trajectory derived from its Unicode integer representation, establishing a bijective and interpretable mapping between discrete symbols and a continuous vector space — with no learned parameters, no corpus, and no randomness.

📘 Reference
Schmitz, T. (2025). Harmonic Token Projection (HTP): A Vocabulary-Free, Training-Free, Deterministic, and Reversible Embedding Methodology.
arXiv: 2511.20665 · DOI: 10.5281/zenodo.17575155


🔖 Key Features

  • 🚫 No training, no vocabulary — pure analytic transform, works on any Unicode string
  • 🔁 Fully reversible — exact token recovery via the Chinese Remainder Theorem
  • 🎯 Deterministic — identical input always yields identical output (no randomness)
  • 🪶 Lightweight — sub-megabyte footprint, sub-millisecond per sentence pair, CPU-only
  • 🔍 Interpretable — every coordinate is a harmonic of a modular residue
  • 🌍 Language-agnostic — ρ ≈ 0.68–0.70 (EN) and ρ ≈ 0.64 averaged over 10 languages on STS-B
  • 🧩 Minimal dependencies — only numpy (optional jieba, scipy, datasets)

📦 Installation

pip install harmonic-token-projection

Optional extras:

pip install 'harmonic-token-projection[zh]'    # jieba segmenter for Chinese
pip install 'harmonic-token-projection[eval]'  # scipy + datasets for STS evaluation
pip install 'harmonic-token-projection[dev]'   # test / lint / build tooling

⚙️ How It Works

For a token t = [c₁, …, c_ℓ]:

Step Equation Description
1. Unicode uᵢ = ord(cᵢ) character → code point
2. Padding ũ = [u₁,…,uₗ,0,…,0] zero-pad to fixed length L_max
3. Integer Nₜ = Σ ũⱼ·B^(L_max−j), B = 2¹⁶ read as a base-B number
4. Residues rᵢ = Nₜ mod mᵢ decompose over pairwise-coprime moduli
5. Harmonics Eᵢ = [sin(2πrᵢ/mᵢ), cos(2πrᵢ/mᵢ)] project each residue → E(t) ∈ ℝ²ᵏ

Inversion recovers each residue from its phase r̃ᵢ = round(atan2(sᵢ,cᵢ)/2π · mᵢ) and reconstructs Nₜ via the Chinese Remainder Theorem, then decodes the base-B digits back to characters. By default HTP uses the first k = D/2 primes as moduli, which are pairwise coprime and give a modulus product M large enough to make every token up to model.reversible_max_len characters exactly reversible.


🧪 Detailed Examples

1️⃣ Token-level: deterministic & reversible

from htp import HTP

model = HTP(dim=512, max_len=32)

vec = model.encode_token("harmonic")   # numpy array, shape (512,)
print(vec.shape)                        # (512,)
print(model.decode_token(vec))          # -> 'harmonic'   (lossless)
print(model.token_to_int("harmonic"))   # -> deterministic integer Nₜ
print(model.reversible_max_len)          # -> 143 (chars guaranteed to round-trip)

2️⃣ Sentence-level: harmonic pooling & similarity

from htp import HTP

model = HTP(dim=512)

emb = model.encode("the cat sat on the mat")        # (512,) L2-normalized
mat = model.encode_batch(["first sentence",
                          "second one"])             # (2, 512)

print(model.similarity("a man is playing a guitar",
                       "a person plays the guitar")) # ~0.44
print(model.similarity("a man is playing a guitar",
                       "the stock market fell"))     # ~-0.03

3️⃣ Frequency-aware pooling (ITF / TF-IDF)

from htp import HTP

corpus = ["the cat sat on the mat", "a dog ran in the park", "the bird flew away"]

model = HTP(dim=512, pooling="tfidf")
model.fit(corpus)        # collects token frequencies — trains NO parameters
emb = model.encode("the rare cat")   # common words ("the") down-weighted

Pooling strategies (pooling=...):

Strategy Weighting
"itf" (default) Inverse Token Frequency w = 1/log(1+f(t))
"tfidf" TF-IDF (call model.fit(corpus) first)
"mean" uniform
"stopword" drop stopwords, then mean

4️⃣ Multilingual round-trip & STS evaluation

from htp import HTP
from htp.evaluate import evaluate_pairs   # requires the [eval] extra

model = HTP(dim=512, max_len=32)

# Reversible across scripts
for t in ["représentation", "Schlüssel", "coração", "язык", "日本語"]:
    assert model.decode_token(model.encode_token(t)) == t

# Correlate against human similarity judgments
pairs = [("a man is eating food", "a man eats something"),
         ("a plane is taking off", "a dog is running")]
gold  = [4.2, 0.5]
print(evaluate_pairs(model, pairs, gold))   # {'spearman': ..., 'pearson': ...}

🧰 API Reference

HTP(dim=512, max_len=32, moduli=None, pooling="itf",
    tokenizer="regex", lowercase=False, stopwords="en")

model.encode_token(token)        # str      -> ndarray (dim,)
model.decode_token(vector)       # ndarray  -> str
model.token_to_int(token)        # str      -> int  (Nₜ)
model.int_to_token(value)        # int      -> str
model.encode(text, pooling=None) # str      -> ndarray (dim,)
model.encode_batch(texts)        # list     -> ndarray (n, dim)
model.similarity(a, b)           # str, str -> float
model.fit(corpus)                # collect ITF/TF-IDF statistics
model.reversible_max_len         # max token length guaranteed to round-trip

🔬 Properties

Property HTP
Training none (analytic)
Vocabulary none (any Unicode string)
Determinism identical input → identical output
Reversibility exact token recovery via CRT
Footprint sub-megabyte, sub-millisecond, CPU-only
Interpretability each coordinate is a harmonic of a modular residue

📖 Citation

@article{schmitz2025htp,
  title   = {Harmonic Token Projection (HTP): A Vocabulary-Free, Training-Free,
             Deterministic, and Reversible Embedding Methodology},
  author  = {Schmitz, Tcharlies},
  journal = {arXiv preprint arXiv:2511.20665},
  year    = {2025},
  doi     = {10.5281/zenodo.17575155}
}

📝 License

MIT © 2025 Tcharlies Schmitz — Data Science, PX.Center

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

harmonic_token_projection-0.1.0.tar.gz (18.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

harmonic_token_projection-0.1.0-py3-none-any.whl (20.4 kB view details)

Uploaded Python 3

File details

Details for the file harmonic_token_projection-0.1.0.tar.gz.

File metadata

File hashes

Hashes for harmonic_token_projection-0.1.0.tar.gz
Algorithm Hash digest
SHA256 fb0dc2f9a87ea184677c2ead41d014a728084d57444f330891250c52575ad329
MD5 e870a5886ee61ac1f6013567b3815511
BLAKE2b-256 8064a5b2c833689a8ae868ea07933b315b25477e651c915581858dbae06cff99

See more details on using hashes here.

File details

Details for the file harmonic_token_projection-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for harmonic_token_projection-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8c997063f202de8f5d87421a0c37c7eb981b8578c0a64264a29a59c5e5f8e40a
MD5 ba368203663d321d6aa2ec29a7fa31a0
BLAKE2b-256 b467145982f4c781eada08158006e8f703926b744c726721a5f70188890cdf8d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page