Calibrate dictionary hit rates in computational decipherment. Detects when short decoded strings collide with dictionary entries by chance.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

mruckman1

These details have not been verified by PyPI

Project links

Paper

Project description

dictcollision

Calibrate dictionary hit rates. Given a list of short strings and a reference dictionary, separate real matches from chance collisions.

The general problem

You have a stream of short strings and a big reference dictionary. Some fraction match. How many are real matches vs. the dictionary being large enough that anything would match?

Domain	"Decoded tokens"	"Dictionary"	"Real signal" means
Decipherment / cryptanalysis	candidate plaintext (¹)	language wordlist	the decode works
OCR validation	extracted strings	lexicon	the OCR read correctly
Spell-check eval	candidate corrections	target vocabulary	the correction fired
Autocomplete ranking	prefix expansions	vocab	the candidate is meaningful
Password audit	cracked-string attempts	common-words list	the password was weak, not a random collision
Fuzzy dedup	near-match candidates	canonical set	they are actually duplicates
RNG / fuzzer QA	generated strings	wordlist	did your generator accidentally emit real words?

If the input is short (2–4 chars) and the dictionary is large (10K+), naive hit-rate metrics are badly inflated by chance. This package fixes that.

(¹) When the candidate plaintext was produced by a stochastic search over a key space (SA, hill-climbing), the cipher symbols themselves are also a relevant input — see When your decode came from a search.

Install

pip install dictcollision
pip install "dictcollision[viz]"   # with matplotlib

Or with uv:

uv add dictcollision                 # into a uv project
uv pip install dictcollision         # into the active venv
uv tool install dictcollision        # install the CLI globally

Quick start

from dictcollision import noise_floor, classify, recommend

# 1. One-line prediction: how much of my 43% hit rate is chance?
predicted = noise_floor(decoded_tokens, dictionary)
print(f"Chance alone predicts {predicted:.1%}; observed 43.0%; "
      f"excess {0.43 - predicted:.1%}")

# 2. Full four-category analysis.
result = classify(decoded_tokens, dictionary)
print(result.summary())

# 3. Rank candidate dictionaries when you don't know the language.
ranked = recommend(decoded_tokens,
                   {"latin_10k": latin_words, "german_50k": german_words})
print(ranked[0].name, ranked[0].excess)

result.summary() prints:

ClassifyResult (n=5000 tokens)
  apparent hit rate :  99.6%
  net signal        :  70.1%   <- calibrated metric
  correction        :  29.5%   <- amount subtracted

  signal       70.1%  ████████████████░░░░░░░  real matches
  shared_hit   19.4%  ████░░░░░░░░░░░░░░░░░░░  chance collisions
  anti_signal   0.6%  ░░░░░░░░░░░░░░░░░░░░░░░  phantom matches
  shared_miss   9.9%  ██░░░░░░░░░░░░░░░░░░░░░  non-dict tokens

  Interpretation: strong signal — dictionary is a good fit

Command line

No Python required:

python -m dictcollision --tokens decoded.txt --dict latin_50k.txt
python -m dictcollision --tokens decoded.txt --dict latin.txt --baselines
python -m dictcollision --tokens decoded.txt --dict latin.txt --json > report.json

Supported dictionary formats: one word per line, word count (hermitdave FrequencyWords), or CSV. See python -m dictcollision --help.

Input and output contract

Input:

decoded_tokens : list[str]          # e.g. ["the", "cat", "ab", "cd", ...]
                                    # any Unicode; no preprocessing assumed
dictionary     : set[str] | list[str]   # reference words, same encoding

Output (classify) → ClassifyResult dataclass:

Field	Type	Range	Meaning
`net_signal`	float	`[-1, 1]`	The calibrated metric. signal − anti_signal
`signal`	float	`[0, 1]`	real hits
`shared_hit`	float	`[0, 1]`	chance collisions that happen to also be real words
`anti_signal`	float	`[0, 1]`	phantom matches (null-only)
`shared_miss`	float	`[0, 1]`	non-dictionary tokens
`apparent_hit_rate`	float	`[0, 1]`	what a naive evaluator would report
`correction`	float	`≥ 0`	apparent − net_signal
`signal_words`	`list[str]`		types driving real signal
`anti_signal_words`	`list[str]`		types that inflate chance — inspect to diagnose
`n_tokens`	int		total count

Interpreting `net_signal`

Range	Meaning
`≥ 0.20`	Strong signal — dictionary is a good fit
`0.05 – 0.20`	Partial signal — possibly correct with caveats
`≈ 0`	No signal beyond chance
`< 0`	Worse than random — wrong language or wrong decode key

The core equation

The predicted noise floor for dictionary $D$ against decoded text with character distribution $p$ is:

$$\hat{r} \;=\; \sum_{w \in D}\; \prod_{i=1}^{|w|} p(w_i)$$

For every word in the dictionary, multiply together the character frequencies of your decoded output. Sum. That number is how many of your tokens would match by accident.

Four-category framework

Category	In dictionary?	In real text?	In null corpora?
Signal	yes	yes	no
Shared hit	yes	yes	yes
Anti-signal	yes	no	yes
Shared miss	no	—	—

Net signal = Signal − Anti-signal is the calibrated metric.

Null corpora are generated from the decoded text's character bigram distribution (configurable: unigram / bigram / trigram), preserving character-pair frequencies and token lengths while destroying word identity. On wrong-language evaluations the four-category framework is the only method among six tested that correctly reports signal as ≤ 0.

When your decode came from a search

If your decoded tokens are the output of a stochastic search over a key space (simulated annealing on a substitution alphabet, hill-climbing, AZdecrypt, etc.), net_signal alone can mislead. The search itself can manufacture apparent signal: a quadgram-optimised key on a short cipher will find local optima that resolve into a handful of high-frequency dictionary words even when the cipher has no underlying linguistic structure. The Dorabella case (Ruckman 2026) documents this failure mode at +0.55 net_signal.

The fix is to give the same search procedure the same matched-budget opportunity on shuffles of the cipher — multiset-preserving permutations that destroy positional content but keep the character budget constant. If the search finds materially more signal on the real cipher than on its shuffles, that excess is the calibrated signal.

from dictcollision import search_calibrated_signal

result = search_calibrated_signal(
    cipher_symbols=cipher,        # the cipher itself, not decoded tokens
    search_fn=my_sa_search,       # cipher -> decoded tokens
    dictionary=word_set,
    n_shuffles=30,
)
print(result.summary())
# z_score >= 3 → search finds real signal
# −1 ≤ z < 1   → indistinguishable from a shuffle baseline

search_calibrated_signal and null_distribution solve different problems:

Question	Use
"is this fixed decode's signal distinguishable from a bigram null?"	`null_distribution()`
"does my search procedure find more signal on the real cipher than on shuffles of it?"	`search_calibrated_signal()`

Both can be informative; reach for search_calibrated_signal whenever the decoded tokens were chosen by a key-space optimiser.

Full API

from dictcollision import (
    noise_floor,                  # analytical collision prediction
    classify,                     # four-category classification
    classify_by_length,           # per-length-bucket breakdown
    recommend,                    # rank candidate dictionaries
    null_distribution,            # Monte Carlo null distribution
    bootstrap_ci,                 # bootstrap CI on net_signal
    search_calibrated_signal,     # matched-budget shuffle calibration
    load_dictionary, load_tokens, # file loaders
)

from dictcollision.baselines import (
    apparent_hit_rate,            # no correction
    subtract_null,                # naive baseline
    permutation_test,             # per-word Poisson test
    bh_fdr,                       # Benjamini-Hochberg
    blast_evalue,                 # BLAST-style E-value
    all_methods,                  # all six in one dict (Table 2 style)
)

from dictcollision.viz import (
    plot_decomposition,           # paper Figure 1
    plot_size_sweep,              # paper Figure 2
    plot_method_comparison,       # paper Figure 5
    plot_length_stratified,       # paper Figure 13
)

Examples

Self-contained scripts in examples/:

01_vigenere.py — evaluate a Vigenere candidate key
02_paper_table2.py — reproduce the six-method comparison
03_dictionary_recommender.py — pick the right dictionary without knowing the language
04_search_calibrated.py — calibrate a stochastic search against shuffle baseline

Paper

Methodology, experiments, validation:

Ruckman, M. (2026). The Dictionary Collision Effect in Computational Decipherment. Source, figures, and reproduction code: https://github.com/mruckman1/signal-isolation-paper

Citation

@article{ruckman2026dictcollision,
  title={The Dictionary Collision Effect in Computational Decipherment},
  author={Ruckman, Matthew},
  year={2026},
  url={https://github.com/mruckman1/signal-isolation-paper}
}

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

mruckman1

These details have not been verified by PyPI

Project links

Paper

Release history Release notifications | RSS feed

This version

0.3.0

May 4, 2026

0.2.1

Apr 17, 2026

0.2.0

Apr 17, 2026

0.1.1

Apr 17, 2026

0.1.0

Apr 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dictcollision-0.3.0.tar.gz (105.5 kB view details)

Uploaded May 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dictcollision-0.3.0-py3-none-any.whl (31.4 kB view details)

Uploaded May 4, 2026 Python 3

File details

Details for the file dictcollision-0.3.0.tar.gz.

File metadata

Download URL: dictcollision-0.3.0.tar.gz
Upload date: May 4, 2026
Size: 105.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dictcollision-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`15f6161f359b097b5a99b8b35b730177f5d9580b00d6c9344c3478ecfaa92260`
MD5	`6357ccffdc3eb6af9b244b7168ea408d`
BLAKE2b-256	`0061f3675ea69282ea72181821947788dc7af50dff24c7e20f2ec3f15b678ab0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dictcollision-0.3.0.tar.gz:

Publisher: publish.yml on mruckman1/dictcollision

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dictcollision-0.3.0.tar.gz
- Subject digest: 15f6161f359b097b5a99b8b35b730177f5d9580b00d6c9344c3478ecfaa92260
- Sigstore transparency entry: 1436934559
- Sigstore integration time: May 4, 2026
Source repository:
- Permalink: mruckman1/dictcollision@88b977debd91548b3d47f1424077ba368b27a0c2
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/mruckman1
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@88b977debd91548b3d47f1424077ba368b27a0c2
- Trigger Event: push

File details

Details for the file dictcollision-0.3.0-py3-none-any.whl.

File metadata

Download URL: dictcollision-0.3.0-py3-none-any.whl
Upload date: May 4, 2026
Size: 31.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dictcollision-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`36fb9b71a090d7654edfd278fde2199ef022cb297f6b05e932d5d451ead760ad`
MD5	`c53389efabfcd7ed3ae7901be243aefa`
BLAKE2b-256	`cd9a4b2c01d9d22e3d07f248c6cf25473f18af1b1bf0e990fe6de4e5138d5fa9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dictcollision-0.3.0-py3-none-any.whl:

Publisher: publish.yml on mruckman1/dictcollision

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dictcollision-0.3.0-py3-none-any.whl
- Subject digest: 36fb9b71a090d7654edfd278fde2199ef022cb297f6b05e932d5d451ead760ad
- Sigstore transparency entry: 1436934581
- Sigstore integration time: May 4, 2026
Source repository:
- Permalink: mruckman1/dictcollision@88b977debd91548b3d47f1424077ba368b27a0c2
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/mruckman1
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@88b977debd91548b3d47f1424077ba368b27a0c2
- Trigger Event: push

dictcollision 0.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

dictcollision

The general problem

Install

Quick start

Command line

Input and output contract

Interpreting net_signal

The core equation

Four-category framework

When your decode came from a search

Full API

Examples

Paper

Citation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Interpreting `net_signal`