Calibrate dictionary hit rates in computational decipherment. Detects when short decoded strings collide with dictionary entries by chance.
Project description
dictcollision
Calibrate dictionary hit rates. Given a list of short strings and a reference dictionary, separate real matches from chance collisions.
The general problem
You have a stream of short strings and a big reference dictionary. Some fraction match. How many are real matches vs. the dictionary being large enough that anything would match?
| Domain | "Decoded tokens" | "Dictionary" | "Real signal" means |
|---|---|---|---|
| Decipherment / cryptanalysis | candidate plaintext (¹) | language wordlist | the decode works |
| OCR validation | extracted strings | lexicon | the OCR read correctly |
| Spell-check eval | candidate corrections | target vocabulary | the correction fired |
| Autocomplete ranking | prefix expansions | vocab | the candidate is meaningful |
| Password audit | cracked-string attempts | common-words list | the password was weak, not a random collision |
| Fuzzy dedup | near-match candidates | canonical set | they are actually duplicates |
| RNG / fuzzer QA | generated strings | wordlist | did your generator accidentally emit real words? |
If the input is short (2–4 chars) and the dictionary is large (10K+), naive hit-rate metrics are badly inflated by chance. This package fixes that.
(¹) When the candidate plaintext was produced by a stochastic search over a key space (SA, hill-climbing), the cipher symbols themselves are also a relevant input — see When your decode came from a search.
Install
pip install dictcollision
pip install "dictcollision[viz]" # with matplotlib
Or with uv:
uv add dictcollision # into a uv project
uv pip install dictcollision # into the active venv
uv tool install dictcollision # install the CLI globally
Quick start
from dictcollision import noise_floor, classify, recommend
# 1. One-line prediction: how much of my 43% hit rate is chance?
predicted = noise_floor(decoded_tokens, dictionary)
print(f"Chance alone predicts {predicted:.1%}; observed 43.0%; "
f"excess {0.43 - predicted:.1%}")
# 2. Full four-category analysis.
result = classify(decoded_tokens, dictionary)
print(result.summary())
# 3. Rank candidate dictionaries when you don't know the language.
ranked = recommend(decoded_tokens,
{"latin_10k": latin_words, "german_50k": german_words})
print(ranked[0].name, ranked[0].excess)
result.summary() prints:
ClassifyResult (n=5000 tokens)
apparent hit rate : 99.6%
net signal : 70.1% <- calibrated metric
correction : 29.5% <- amount subtracted
signal 70.1% ████████████████░░░░░░░ real matches
shared_hit 19.4% ████░░░░░░░░░░░░░░░░░░░ chance collisions
anti_signal 0.6% ░░░░░░░░░░░░░░░░░░░░░░░ phantom matches
shared_miss 9.9% ██░░░░░░░░░░░░░░░░░░░░░ non-dict tokens
Interpretation: strong signal — dictionary is a good fit
Command line
No Python required:
python -m dictcollision --tokens decoded.txt --dict latin_50k.txt
python -m dictcollision --tokens decoded.txt --dict latin.txt --baselines
python -m dictcollision --tokens decoded.txt --dict latin.txt --json > report.json
Supported dictionary formats: one word per line, word count (hermitdave
FrequencyWords), or CSV. See python -m dictcollision --help.
Input and output contract
Input:
decoded_tokens : list[str] # e.g. ["the", "cat", "ab", "cd", ...]
# any Unicode; no preprocessing assumed
dictionary : set[str] | list[str] # reference words, same encoding
Output (classify) → ClassifyResult dataclass:
| Field | Type | Range | Meaning |
|---|---|---|---|
net_signal |
float | [-1, 1] |
The calibrated metric. signal − anti_signal |
signal |
float | [0, 1] |
real hits |
shared_hit |
float | [0, 1] |
chance collisions that happen to also be real words |
anti_signal |
float | [0, 1] |
phantom matches (null-only) |
shared_miss |
float | [0, 1] |
non-dictionary tokens |
apparent_hit_rate |
float | [0, 1] |
what a naive evaluator would report |
correction |
float | ≥ 0 |
apparent − net_signal |
signal_words |
list[str] |
types driving real signal | |
anti_signal_words |
list[str] |
types that inflate chance — inspect to diagnose | |
n_tokens |
int | total count |
Interpreting net_signal
| Range | Meaning |
|---|---|
≥ 0.20 |
Strong signal — dictionary is a good fit |
0.05 – 0.20 |
Partial signal — possibly correct with caveats |
≈ 0 |
No signal beyond chance |
< 0 |
Worse than random — wrong language or wrong decode key |
The core equation
The predicted noise floor for dictionary $D$ against decoded text with character distribution $p$ is:
$$\hat{r} \;=\; \sum_{w \in D}\; \prod_{i=1}^{|w|} p(w_i)$$
For every word in the dictionary, multiply together the character frequencies of your decoded output. Sum. That number is how many of your tokens would match by accident.
Four-category framework
| Category | In dictionary? | In real text? | In null corpora? |
|---|---|---|---|
| Signal | yes | yes | no |
| Shared hit | yes | yes | yes |
| Anti-signal | yes | no | yes |
| Shared miss | no | — | — |
Net signal = Signal − Anti-signal is the calibrated metric.
Null corpora are generated from the decoded text's character bigram distribution (configurable: unigram / bigram / trigram), preserving character-pair frequencies and token lengths while destroying word identity. On wrong-language evaluations the four-category framework is the only method among six tested that correctly reports signal as ≤ 0.
When your decode came from a search
If your decoded tokens are the output of a stochastic search over a
key space (simulated annealing on a substitution alphabet, hill-climbing,
AZdecrypt, etc.), net_signal alone can mislead. The search itself can
manufacture apparent signal: a quadgram-optimised key on a short cipher
will find local optima that resolve into a handful of high-frequency
dictionary words even when the cipher has no underlying linguistic
structure. The Dorabella case (Ruckman 2026) documents this failure
mode at +0.55 net_signal.
The fix is to give the same search procedure the same matched-budget opportunity on shuffles of the cipher — multiset-preserving permutations that destroy positional content but keep the character budget constant. If the search finds materially more signal on the real cipher than on its shuffles, that excess is the calibrated signal.
from dictcollision import search_calibrated_signal
result = search_calibrated_signal(
cipher_symbols=cipher, # the cipher itself, not decoded tokens
search_fn=my_sa_search, # cipher -> decoded tokens
dictionary=word_set,
n_shuffles=30,
)
print(result.summary())
# z_score >= 3 → search finds real signal
# −1 ≤ z < 1 → indistinguishable from a shuffle baseline
search_calibrated_signal and null_distribution solve different
problems:
| Question | Use |
|---|---|
| "is this fixed decode's signal distinguishable from a bigram null?" | null_distribution() |
| "does my search procedure find more signal on the real cipher than on shuffles of it?" | search_calibrated_signal() |
Both can be informative; reach for search_calibrated_signal whenever
the decoded tokens were chosen by a key-space optimiser.
Full API
from dictcollision import (
noise_floor, # analytical collision prediction
classify, # four-category classification
classify_by_length, # per-length-bucket breakdown
recommend, # rank candidate dictionaries
null_distribution, # Monte Carlo null distribution
bootstrap_ci, # bootstrap CI on net_signal
search_calibrated_signal, # matched-budget shuffle calibration
load_dictionary, load_tokens, # file loaders
)
from dictcollision.baselines import (
apparent_hit_rate, # no correction
subtract_null, # naive baseline
permutation_test, # per-word Poisson test
bh_fdr, # Benjamini-Hochberg
blast_evalue, # BLAST-style E-value
all_methods, # all six in one dict (Table 2 style)
)
from dictcollision.viz import (
plot_decomposition, # paper Figure 1
plot_size_sweep, # paper Figure 2
plot_method_comparison, # paper Figure 5
plot_length_stratified, # paper Figure 13
)
Examples
Self-contained scripts in examples/:
- 01_vigenere.py — evaluate a Vigenere candidate key
- 02_paper_table2.py — reproduce the six-method comparison
- 03_dictionary_recommender.py — pick the right dictionary without knowing the language
- 04_search_calibrated.py — calibrate a stochastic search against shuffle baseline
Paper
Methodology, experiments, validation:
Ruckman, M. (2026). The Dictionary Collision Effect in Computational Decipherment. Source, figures, and reproduction code: https://github.com/mruckman1/signal-isolation-paper
Citation
@article{ruckman2026dictcollision,
title={The Dictionary Collision Effect in Computational Decipherment},
author={Ruckman, Matthew},
year={2026},
url={https://github.com/mruckman1/signal-isolation-paper}
}
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dictcollision-0.3.0.tar.gz.
File metadata
- Download URL: dictcollision-0.3.0.tar.gz
- Upload date:
- Size: 105.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
15f6161f359b097b5a99b8b35b730177f5d9580b00d6c9344c3478ecfaa92260
|
|
| MD5 |
6357ccffdc3eb6af9b244b7168ea408d
|
|
| BLAKE2b-256 |
0061f3675ea69282ea72181821947788dc7af50dff24c7e20f2ec3f15b678ab0
|
Provenance
The following attestation bundles were made for dictcollision-0.3.0.tar.gz:
Publisher:
publish.yml on mruckman1/dictcollision
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dictcollision-0.3.0.tar.gz -
Subject digest:
15f6161f359b097b5a99b8b35b730177f5d9580b00d6c9344c3478ecfaa92260 - Sigstore transparency entry: 1436934559
- Sigstore integration time:
-
Permalink:
mruckman1/dictcollision@88b977debd91548b3d47f1424077ba368b27a0c2 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/mruckman1
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@88b977debd91548b3d47f1424077ba368b27a0c2 -
Trigger Event:
push
-
Statement type:
File details
Details for the file dictcollision-0.3.0-py3-none-any.whl.
File metadata
- Download URL: dictcollision-0.3.0-py3-none-any.whl
- Upload date:
- Size: 31.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
36fb9b71a090d7654edfd278fde2199ef022cb297f6b05e932d5d451ead760ad
|
|
| MD5 |
c53389efabfcd7ed3ae7901be243aefa
|
|
| BLAKE2b-256 |
cd9a4b2c01d9d22e3d07f248c6cf25473f18af1b1bf0e990fe6de4e5138d5fa9
|
Provenance
The following attestation bundles were made for dictcollision-0.3.0-py3-none-any.whl:
Publisher:
publish.yml on mruckman1/dictcollision
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dictcollision-0.3.0-py3-none-any.whl -
Subject digest:
36fb9b71a090d7654edfd278fde2199ef022cb297f6b05e932d5d451ead760ad - Sigstore transparency entry: 1436934581
- Sigstore integration time:
-
Permalink:
mruckman1/dictcollision@88b977debd91548b3d47f1424077ba368b27a0c2 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/mruckman1
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@88b977debd91548b3d47f1424077ba368b27a0c2 -
Trigger Event:
push
-
Statement type: