Skip to main content

Calibrate dictionary hit rates in computational decipherment. Detects when short decoded strings collide with dictionary entries by chance.

Project description

dictcollision

PyPI Python License

Calibrate dictionary hit rates. Given a list of short strings and a reference dictionary, separate real matches from chance collisions.


The general problem

You have a stream of short strings and a big reference dictionary. Some fraction match. How many are real matches vs. the dictionary being large enough that anything would match?

Domain "Decoded tokens" "Dictionary" "Real signal" means
Decipherment / cryptanalysis candidate plaintext (¹) language wordlist the decode works
OCR validation extracted strings lexicon the OCR read correctly
Spell-check eval candidate corrections target vocabulary the correction fired
Autocomplete ranking prefix expansions vocab the candidate is meaningful
Password audit cracked-string attempts common-words list the password was weak, not a random collision
Fuzzy dedup near-match candidates canonical set they are actually duplicates
RNG / fuzzer QA generated strings wordlist did your generator accidentally emit real words?

If the input is short (2–4 chars) and the dictionary is large (10K+), naive hit-rate metrics are badly inflated by chance. This package fixes that.

(¹) When the candidate plaintext was produced by a stochastic search over a key space (SA, hill-climbing), the cipher symbols themselves are also a relevant input — see When your decode came from a search.

Install

pip install dictcollision
pip install "dictcollision[viz]"   # with matplotlib

Or with uv:

uv add dictcollision                 # into a uv project
uv pip install dictcollision         # into the active venv
uv tool install dictcollision        # install the CLI globally

Quick start

from dictcollision import noise_floor, classify, recommend

# 1. One-line prediction: how much of my 43% hit rate is chance?
predicted = noise_floor(decoded_tokens, dictionary)
print(f"Chance alone predicts {predicted:.1%}; observed 43.0%; "
      f"excess {0.43 - predicted:.1%}")

# 2. Full four-category analysis.
result = classify(decoded_tokens, dictionary)
print(result.summary())

# 3. Rank candidate dictionaries when you don't know the language.
ranked = recommend(decoded_tokens,
                   {"latin_10k": latin_words, "german_50k": german_words})
print(ranked[0].name, ranked[0].excess)

result.summary() prints:

ClassifyResult (n=5000 tokens)
  apparent hit rate :  99.6%
  net signal        :  70.1%   <- calibrated metric
  correction        :  29.5%   <- amount subtracted

  signal       70.1%  ████████████████░░░░░░░  real matches
  shared_hit   19.4%  ████░░░░░░░░░░░░░░░░░░░  chance collisions
  anti_signal   0.6%  ░░░░░░░░░░░░░░░░░░░░░░░  phantom matches
  shared_miss   9.9%  ██░░░░░░░░░░░░░░░░░░░░░  non-dict tokens

  Interpretation: strong signal — dictionary is a good fit

Command line

No Python required:

python -m dictcollision --tokens decoded.txt --dict latin_50k.txt
python -m dictcollision --tokens decoded.txt --dict latin.txt --baselines
python -m dictcollision --tokens decoded.txt --dict latin.txt --json > report.json

Supported dictionary formats: one word per line, word count (hermitdave FrequencyWords), or CSV. See python -m dictcollision --help.

Input and output contract

Input:

decoded_tokens : list[str]          # e.g. ["the", "cat", "ab", "cd", ...]
                                    # any Unicode; no preprocessing assumed
dictionary     : set[str] | list[str]   # reference words, same encoding

Output (classify)ClassifyResult dataclass:

Field Type Range Meaning
net_signal float [-1, 1] The calibrated metric. signal − anti_signal
signal float [0, 1] real hits
shared_hit float [0, 1] chance collisions that happen to also be real words
anti_signal float [0, 1] phantom matches (null-only)
shared_miss float [0, 1] non-dictionary tokens
apparent_hit_rate float [0, 1] what a naive evaluator would report
correction float ≥ 0 apparent − net_signal
signal_words list[str] types driving real signal
anti_signal_words list[str] types that inflate chance — inspect to diagnose
n_tokens int total count

Interpreting net_signal

Range Meaning
≥ 0.20 Strong signal — dictionary is a good fit
0.05 – 0.20 Partial signal — possibly correct with caveats
≈ 0 No signal beyond chance
< 0 Worse than random — wrong language or wrong decode key

The core equation

The predicted noise floor for dictionary $D$ against decoded text with character distribution $p$ is:

$$\hat{r} \;=\; \sum_{w \in D}\; \prod_{i=1}^{|w|} p(w_i)$$

For every word in the dictionary, multiply together the character frequencies of your decoded output. Sum. That number is how many of your tokens would match by accident.

Four-category framework

Category In dictionary? In real text? In null corpora?
Signal yes yes no
Shared hit yes yes yes
Anti-signal yes no yes
Shared miss no

Net signal = Signal − Anti-signal is the calibrated metric.

Null corpora are generated from the decoded text's character bigram distribution (configurable: unigram / bigram / trigram), preserving character-pair frequencies and token lengths while destroying word identity. On wrong-language evaluations the four-category framework is the only method among six tested that correctly reports signal as ≤ 0.

When your decode came from a search

If your decoded tokens are the output of a stochastic search over a key space (simulated annealing on a substitution alphabet, hill-climbing, AZdecrypt, etc.), net_signal alone can mislead. The search itself can manufacture apparent signal: a quadgram-optimised key on a short cipher will find local optima that resolve into a handful of high-frequency dictionary words even when the cipher has no underlying linguistic structure. The Dorabella case (Ruckman 2026) documents this failure mode at +0.55 net_signal.

The fix is to give the same search procedure the same matched-budget opportunity on shuffles of the cipher — multiset-preserving permutations that destroy positional content but keep the character budget constant. If the search finds materially more signal on the real cipher than on its shuffles, that excess is the calibrated signal.

from dictcollision import search_calibrated_signal

result = search_calibrated_signal(
    cipher_symbols=cipher,        # the cipher itself, not decoded tokens
    search_fn=my_sa_search,       # cipher -> decoded tokens
    dictionary=word_set,
    n_shuffles=30,
)
print(result.summary())
# z_score >= 3 → search finds real signal
# −1 ≤ z < 1   → indistinguishable from a shuffle baseline

search_calibrated_signal and null_distribution solve different problems:

Question Use
"is this fixed decode's signal distinguishable from a bigram null?" null_distribution()
"does my search procedure find more signal on the real cipher than on shuffles of it?" search_calibrated_signal()

Both can be informative; reach for search_calibrated_signal whenever the decoded tokens were chosen by a key-space optimiser.

Full API

from dictcollision import (
    noise_floor,                  # analytical collision prediction
    classify,                     # four-category classification
    classify_by_length,           # per-length-bucket breakdown
    recommend,                    # rank candidate dictionaries
    null_distribution,            # Monte Carlo null distribution
    bootstrap_ci,                 # bootstrap CI on net_signal
    search_calibrated_signal,     # matched-budget shuffle calibration
    load_dictionary, load_tokens, # file loaders
)

from dictcollision.baselines import (
    apparent_hit_rate,            # no correction
    subtract_null,                # naive baseline
    permutation_test,             # per-word Poisson test
    bh_fdr,                       # Benjamini-Hochberg
    blast_evalue,                 # BLAST-style E-value
    all_methods,                  # all six in one dict (Table 2 style)
)

from dictcollision.viz import (
    plot_decomposition,           # paper Figure 1
    plot_size_sweep,              # paper Figure 2
    plot_method_comparison,       # paper Figure 5
    plot_length_stratified,       # paper Figure 13
)

Examples

Self-contained scripts in examples/:

Paper

Methodology, experiments, validation:

Ruckman, M. (2026). The Dictionary Collision Effect in Computational Decipherment. Source, figures, and reproduction code: https://github.com/mruckman1/signal-isolation-paper

Citation

@article{ruckman2026dictcollision,
  title={The Dictionary Collision Effect in Computational Decipherment},
  author={Ruckman, Matthew},
  year={2026},
  url={https://github.com/mruckman1/signal-isolation-paper}
}

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dictcollision-0.3.0.tar.gz (105.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dictcollision-0.3.0-py3-none-any.whl (31.4 kB view details)

Uploaded Python 3

File details

Details for the file dictcollision-0.3.0.tar.gz.

File metadata

  • Download URL: dictcollision-0.3.0.tar.gz
  • Upload date:
  • Size: 105.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dictcollision-0.3.0.tar.gz
Algorithm Hash digest
SHA256 15f6161f359b097b5a99b8b35b730177f5d9580b00d6c9344c3478ecfaa92260
MD5 6357ccffdc3eb6af9b244b7168ea408d
BLAKE2b-256 0061f3675ea69282ea72181821947788dc7af50dff24c7e20f2ec3f15b678ab0

See more details on using hashes here.

Provenance

The following attestation bundles were made for dictcollision-0.3.0.tar.gz:

Publisher: publish.yml on mruckman1/dictcollision

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dictcollision-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: dictcollision-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 31.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dictcollision-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 36fb9b71a090d7654edfd278fde2199ef022cb297f6b05e932d5d451ead760ad
MD5 c53389efabfcd7ed3ae7901be243aefa
BLAKE2b-256 cd9a4b2c01d9d22e3d07f248c6cf25473f18af1b1bf0e990fe6de4e5138d5fa9

See more details on using hashes here.

Provenance

The following attestation bundles were made for dictcollision-0.3.0-py3-none-any.whl:

Publisher: publish.yml on mruckman1/dictcollision

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page