Calibrate dictionary hit rates in computational decipherment. Detects when short decoded strings collide with dictionary entries by chance.
Project description
dictcollision
Your decipherment reports a 43% dictionary hit rate. Is that real?
dictcollision answers this question. When decoded strings are short
(2–4 characters) and dictionaries are large (≥10K words), chance
collisions produce matches at rates that approach genuine decipherment
rates. This package predicts the collision rate and separates real
signal from noise.
Install
pip install dictcollision
For plotting support:
pip install dictcollision[viz]
Quick start
One-line noise floor check
from dictcollision import noise_floor
predicted = noise_floor(decoded_tokens, dictionary)
print(f"Chance collisions alone: {predicted:.1%}")
print(f"Your observed rate: 43.0%")
print(f"Genuine signal: {0.43 - predicted:.1%}")
Full four-category analysis
from dictcollision import classify
result = classify(decoded_tokens, dictionary)
print(f"Signal: {result.signal:.1%}")
print(f"Shared hit: {result.shared_hit:.1%}")
print(f"Anti-signal: {result.anti_signal:.1%}")
print(f"Net signal: {result.net_signal:.1%}")
print(f"Apparent rate: {result.apparent_hit_rate:.1%}")
Rank candidate dictionaries
from dictcollision import recommend
ranked = recommend(
decoded_tokens,
{"latin_10k": latin_words, "german_50k": german_words},
objective="excess",
)
for r in ranked:
print(f"{r.name}: excess={r.excess:.3f}, snr={r.snr:.1f}")
The core equation
The predicted noise floor for dictionary D against decoded text with character distribution p is:
r̂ = Σw∈D Πi p(wᵢ)
For every word in the dictionary, multiply together the character frequencies of your decoded output. Sum. That number is how many of your tokens would match by accident.
How it works
The four-category framework classifies every decoded token type:
| Category | In dictionary? | In real text? | In null corpora? |
|---|---|---|---|
| Signal | yes | yes | no |
| Shared hit | yes | yes | yes |
| Anti-signal | yes | no | yes |
| Shared miss | no | — | — |
Net signal = Signal − Anti-signal is the calibrated metric.
Null corpora are generated from the decoded text's character bigram distribution, preserving character-pair frequencies and token lengths while destroying word identity.
Citation
If you use this package in your research, please cite:
@article{ruckman2026dictcollision,
title={The Dictionary Collision Effect in Computational Decipherment},
author={Ruckman, Matthew},
year={2026}
}
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dictcollision-0.1.0.tar.gz.
File metadata
- Download URL: dictcollision-0.1.0.tar.gz
- Upload date:
- Size: 71.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ecaf7f546f26241c2b2e7f3bddb083e35c4049ccefff8ab753e68a3ed1a27f7
|
|
| MD5 |
a331cecc0bf4f2778096f66dcbbf19f9
|
|
| BLAKE2b-256 |
03bc4064a570990fbe413e99ffa105ab2e5fd71069f29a82176412c5684c2843
|
File details
Details for the file dictcollision-0.1.0-py3-none-any.whl.
File metadata
- Download URL: dictcollision-0.1.0-py3-none-any.whl
- Upload date:
- Size: 11.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a661d59cf216a0b87d8e3ba4db7e1159c5997cdbe309cb83108bdd2554867bfd
|
|
| MD5 |
945fe21229c01073a8e56ae53b7fd980
|
|
| BLAKE2b-256 |
9e1d8b63bee12fef502cb717dfb219f5121d3c528649dca201b23f2c8b74b433
|