Skip to main content

Calibrate dictionary hit rates in computational decipherment. Detects when short decoded strings collide with dictionary entries by chance.

Project description

dictcollision

PyPI Python License

Calibrate dictionary hit rates. Given a list of short strings and a reference dictionary, separate real matches from chance collisions.


The general problem

You have a stream of short strings and a big reference dictionary. Some fraction match. How many are real matches vs. the dictionary being large enough that anything would match?

Domain "Decoded tokens" "Dictionary" "Real signal" means
Decipherment / cryptanalysis candidate plaintext language wordlist the decode works
OCR validation extracted strings lexicon the OCR read correctly
Spell-check eval candidate corrections target vocabulary the correction fired
Autocomplete ranking prefix expansions vocab the candidate is meaningful
Password audit cracked-string attempts common-words list the password was weak, not a random collision
Fuzzy dedup near-match candidates canonical set they are actually duplicates
RNG / fuzzer QA generated strings wordlist did your generator accidentally emit real words?

If the input is short (2–4 chars) and the dictionary is large (10K+), naive hit-rate metrics are badly inflated by chance. This package fixes that.

Install

pip install dictcollision
pip install "dictcollision[viz]"   # optional matplotlib plots

Quick start

from dictcollision import noise_floor, classify, recommend

# 1. One-line prediction: how much of my 43% hit rate is chance?
predicted = noise_floor(decoded_tokens, dictionary)
print(f"Chance alone predicts {predicted:.1%}; observed 43.0%; "
      f"excess {0.43 - predicted:.1%}")

# 2. Full four-category analysis.
result = classify(decoded_tokens, dictionary)
print(result.summary())

# 3. Rank candidate dictionaries when you don't know the language.
ranked = recommend(decoded_tokens,
                   {"latin_10k": latin_words, "german_50k": german_words})
print(ranked[0].name, ranked[0].excess)

result.summary() prints:

ClassifyResult (n=5000 tokens)
  apparent hit rate :  99.6%
  net signal        :  70.1%   <- calibrated metric
  correction        :  29.5%   <- amount subtracted

  signal       70.1%  ████████████████░░░░░░░  real matches
  shared_hit   19.4%  ████░░░░░░░░░░░░░░░░░░░  chance collisions
  anti_signal   0.6%  ░░░░░░░░░░░░░░░░░░░░░░░  phantom matches
  shared_miss   9.9%  ██░░░░░░░░░░░░░░░░░░░░░  non-dict tokens

  Interpretation: strong signal — dictionary is a good fit

Command line

No Python required:

python -m dictcollision --tokens decoded.txt --dict latin_50k.txt
python -m dictcollision --tokens decoded.txt --dict latin.txt --baselines
python -m dictcollision --tokens decoded.txt --dict latin.txt --json > report.json

Supported dictionary formats: one word per line, word count (hermitdave FrequencyWords), or CSV. See python -m dictcollision --help.

Input and output contract

Input:

decoded_tokens : list[str]          # e.g. ["the", "cat", "ab", "cd", ...]
                                    # any Unicode; no preprocessing assumed
dictionary     : set[str] | list[str]   # reference words, same encoding

Output (classify)ClassifyResult dataclass:

Field Type Range Meaning
net_signal float [-1, 1] The calibrated metric. signal − anti_signal
signal float [0, 1] real hits
shared_hit float [0, 1] chance collisions that happen to also be real words
anti_signal float [0, 1] phantom matches (null-only)
shared_miss float [0, 1] non-dictionary tokens
apparent_hit_rate float [0, 1] what a naive evaluator would report
correction float ≥ 0 apparent − net_signal
signal_words list[str] types driving real signal
anti_signal_words list[str] types that inflate chance — inspect to diagnose
n_tokens int total count

Interpreting net_signal

Range Meaning
≥ 0.20 Strong signal — dictionary is a good fit
0.05 – 0.20 Partial signal — possibly correct with caveats
≈ 0 No signal beyond chance
< 0 Worse than random — wrong language or wrong decode key

The core equation

The predicted noise floor for dictionary $D$ against decoded text with character distribution $p$ is:

$$\hat{r} \;=\; \sum_{w \in D}\; \prod_{i=1}^{|w|} p(w_i)$$

For every word in the dictionary, multiply together the character frequencies of your decoded output. Sum. That number is how many of your tokens would match by accident.

Four-category framework

Category In dictionary? In real text? In null corpora?
Signal yes yes no
Shared hit yes yes yes
Anti-signal yes no yes
Shared miss no

Net signal = Signal − Anti-signal is the calibrated metric.

Null corpora are generated from the decoded text's character bigram distribution (configurable: unigram / bigram / trigram), preserving character-pair frequencies and token lengths while destroying word identity. On wrong-language evaluations the four-category framework is the only method among six tested that correctly reports signal as ≤ 0.

Full API

from dictcollision import (
    noise_floor,                  # analytical collision prediction
    classify,                     # four-category classification
    classify_by_length,           # per-length-bucket breakdown
    recommend,                    # rank candidate dictionaries
    null_distribution,            # Monte Carlo null distribution
    bootstrap_ci,                 # bootstrap CI on net_signal
    load_dictionary, load_tokens, # file loaders
)

from dictcollision.baselines import (
    apparent_hit_rate,            # no correction
    subtract_null,                # naive baseline
    permutation_test,             # per-word Poisson test
    bh_fdr,                       # Benjamini-Hochberg
    blast_evalue,                 # BLAST-style E-value
    all_methods,                  # all six in one dict (Table 2 style)
)

from dictcollision.viz import (
    plot_decomposition,           # paper Figure 1
    plot_size_sweep,              # paper Figure 2
    plot_method_comparison,       # paper Figure 5
    plot_length_stratified,       # paper Figure 13
)

Examples

Self-contained scripts in examples/:

Paper

Methodology, experiments, validation:

Ruckman, M. (2026). The Dictionary Collision Effect in Computational Decipherment. Source, figures, and reproduction code: https://github.com/mruckman1/signal-isolation-paper

Citation

@article{ruckman2026dictcollision,
  title={The Dictionary Collision Effect in Computational Decipherment},
  author={Ruckman, Matthew},
  year={2026},
  url={https://github.com/mruckman1/signal-isolation-paper}
}

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dictcollision-0.2.0.tar.gz (115.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dictcollision-0.2.0-py3-none-any.whl (26.0 kB view details)

Uploaded Python 3

File details

Details for the file dictcollision-0.2.0.tar.gz.

File metadata

  • Download URL: dictcollision-0.2.0.tar.gz
  • Upload date:
  • Size: 115.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dictcollision-0.2.0.tar.gz
Algorithm Hash digest
SHA256 f11a747b45e95fc10632a2337bbe4bd31e5ec9aa883aaed16e2844fb60d0390d
MD5 632ff93996a3b067b883f6e1a3e68fc5
BLAKE2b-256 2e2d9e61e31b040f6f3f591c0b13a42ae2188bcd9f89c22e861cf37d7911b10d

See more details on using hashes here.

Provenance

The following attestation bundles were made for dictcollision-0.2.0.tar.gz:

Publisher: publish.yml on mruckman1/dictcollision

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dictcollision-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: dictcollision-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 26.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dictcollision-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ba881475ec5a4e8eccd95863cf48ed011e6b3e904b65e3d7c58f3093ae115279
MD5 06ad4d00cfaf9bcd2dbb400f1437ab0c
BLAKE2b-256 a3ac5830442ee268129baeb986aa20e839c078d3e23f284f906c5de09478dbfe

See more details on using hashes here.

Provenance

The following attestation bundles were made for dictcollision-0.2.0-py3-none-any.whl:

Publisher: publish.yml on mruckman1/dictcollision

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page