Corpus keyness, rank-turbulence divergence, and allotaxonographs
Project description
keyflux
Corpus keyness, rank-turbulence divergence, and allotaxonographs — in pure Python.
keyflux owns the whole comparison arc that diachronic and comparative discourse analysis usually splits across tools and languages. It derives keywords and lockwords from a focus-versus-reference comparison using proper corpus-linguistic measures (log-likelihood for significance, log ratio for effect size — not just chi-square), compares the resulting ranked lists with rank-turbulence divergence (RTD), and renders the allotaxonograph: the rank-rank map plus the ranked list of which exact words drove the shift. No JavaScript runtime — figures are matplotlib.
It replaces the usual "Jaccard overlap on the top-N keywords" summary — one opaque number that throws away rank, everything below the cutoff, and any account of which words moved — with a transparent, pip-installable pipeline.
Installation
uv add keyflux
Quickstart
from collections import Counter
from keyflux import Keyness, RankedList, rtd, allotaxonograph
# 1. Keyness: focus vs reference
focus = Counter({"climate": 30, "carbon": 12, "the": 80, "policy": 9})
reference = Counter({"climate": 3, "carbon": 1, "the": 78, "market": 15})
k = Keyness(focus, reference, measure="log_likelihood")
keywords = k.keywords(top=20)
lockwords = k.lockwords()
# 2. Rank-turbulence divergence between two ranked lists
r1 = RankedList.from_counts(focus, label="2019")
r2 = RankedList.from_counts(reference, label="2024")
result = rtd(r1, r2, alpha=1 / 3)
print(result.divergence)
# 3. Allotaxonograph (returns a matplotlib Figure)
fig = allotaxonograph(r1, r2, alpha=1 / 3, labels=("2019", "2024"))
fig.savefig("allotaxonograph.png")
Features
- Keyness measures: log-likelihood (Dunning), log ratio, Simple Maths, %DIFF, and chi-square (for contrast) — significance flagged against the chi-square thresholds
- Keywords and lockwords: positive / negative keywords plus the stable lockword zone
- Rank-turbulence divergence: tunable, rank-sensitive comparison of any two rankings (frequency, keyness score, …) with per-type contributions and an explicit alpha-to-zero log limit
- Allotaxonographs: a two-panel view (
allotaxonograph) and the full Dodds (2020) diamond (allotaxonometer) — rank-rank histogram, iso-divergence contours, wordshift — publication-quality matplotlib, no JS runtime - Reproducibility records: every keyness result emits its reference, cutoffs, and measure
Documentation
Full documentation — quickstart, the keyness and allotaxonograph tutorials,
troubleshooting, and the complete API reference — is at
keyflux.readthedocs.io. The sources live in docs/.
Research direction: comparing many rankings
Rank-turbulence divergence and the allotaxonograph are pairwise — they compare two rankings at a time. This is true of the whole allotaxonometry line, including the 2025 tooling suite (arXiv:2506.21808). But the questions we care about are often many-way: how does presidential vocabulary drift across all eleven eras at once? Which of a dozen speaker groups is the outlier? Comparing many rankings simultaneously is an open problem we intend to research and, eventually, support.
The nearest existing framework is rank aggregation — finding a consensus ranking that best agrees with a set of input rankings. The classic formulation is the Kemeny median (minimise total pairwise disagreement), which is NP-hard, with squared-distance and set-wise / k-wise generalisations (Kemeny aggregation; squared Kemeny; set-wise Kemeny). Candidate directions for keyflux: a pairwise RTD matrix (all-pairs divergence
- clustering/MDS of systems), consensus-vs-each allotaxonographs (compare every ranking against an aggregate), and time-series flipbooks of successive allotaxonographs. If you work on this, we'd love to hear from you.
Roadmap
Planned for the next iteration. The robustness items are analysed in detail in
PRE-MORTEM.md, and the open modelling choices are listed in
CHANGES_SUMMARY.md.
Robustness / API decisions
- Revisit the zero-cell floor default (0.5): it sets the effect size of every exclusive keyword and reorders the top of the list.
- Decide whether
min_focus_freq/min_reference_freqshould default asymmetrically (keep focus-exclusive keywords while demanding more reference evidence). - Add Cohen's d (dispersion-aware effect size) once the corpus input can carry sub-corpus structure.
Proposed features
- Rank by any score, not just frequency (
RankedList.from_scores) — compare keyword rankings, keyness scores, or any metric. - Comparing many rankings at once — see Research direction above.
- Optional self-contained interactive HTML+JS allotaxonograph export (an alpha slider), gated behind an extra so the core stays pure Python.
Maintenance
- Publish to PyPI and wire up ReadTheDocs.
Made by
keyflux is made by Crow Intelligence.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file keyflux-0.2.0.tar.gz.
File metadata
- Download URL: keyflux-0.2.0.tar.gz
- Upload date:
- Size: 24.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
589e3a5e8c3cdd6843ac36836a16e12b05f2b1755225db15ae24f6ce1c6a9d0e
|
|
| MD5 |
7f9a133fa1d79d70c41e8357cbb0a55e
|
|
| BLAKE2b-256 |
8d332574458ba96b7b2d397a5d296f2c6f2b530d9d08e09eb07908b05ec4bdde
|
Provenance
The following attestation bundles were made for keyflux-0.2.0.tar.gz:
Publisher:
publish.yml on crow-intelligence/keyflux
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
keyflux-0.2.0.tar.gz -
Subject digest:
589e3a5e8c3cdd6843ac36836a16e12b05f2b1755225db15ae24f6ce1c6a9d0e - Sigstore transparency entry: 2036240206
- Sigstore integration time:
-
Permalink:
crow-intelligence/keyflux@20b2d316da42072ef874687f6886a6649e11722e -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/crow-intelligence
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@20b2d316da42072ef874687f6886a6649e11722e -
Trigger Event:
release
-
Statement type:
File details
Details for the file keyflux-0.2.0-py3-none-any.whl.
File metadata
- Download URL: keyflux-0.2.0-py3-none-any.whl
- Upload date:
- Size: 32.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e82317f2449570b5a7033fc738d6c46a2034107edd4df97bab99a0c1c18125fe
|
|
| MD5 |
4acf839de5b62c783fec9b27a037ecff
|
|
| BLAKE2b-256 |
c8f447f531c2bfaef85aaadee54e6f0b3bfa6898aa24f40de6d566c15044eeb3
|
Provenance
The following attestation bundles were made for keyflux-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on crow-intelligence/keyflux
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
keyflux-0.2.0-py3-none-any.whl -
Subject digest:
e82317f2449570b5a7033fc738d6c46a2034107edd4df97bab99a0c1c18125fe - Sigstore transparency entry: 2036240439
- Sigstore integration time:
-
Permalink:
crow-intelligence/keyflux@20b2d316da42072ef874687f6886a6649e11722e -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/crow-intelligence
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@20b2d316da42072ef874687f6886a6649e11722e -
Trigger Event:
release
-
Statement type: