Comparative corpus analysis for Python: keyness, collocations, semantic shift, temporal trajectories with changepoints + causal inference.

These details have not been verified by PyPI

Project links

Project description

pycorpdiff

Comparative corpus analysis for modern Python workflows.

pycorpdiff is the missing comparative layer between R's quanteda, the closed-source SketchEngine platform, and the fragmented Python NLP stack (nltk/spaCy/gensim/sentence-transformers). Three public verbs — compare(a, b), track(c, term), compare.before_after(c, event) — consolidate keyness, collocations, dispersion, temporal trajectories, changepoint detection, interrupted time series, causal-impact analysis, forecasting, online changepoint detection, and embedding-based semantic shift under a single notebook-native API. Keyness and collocation results carry their own KWIC evidence: .explain(term) returns the source-text concordances behind any ranked term.

The package answers the questions corpus linguistics, digital humanities, and computational social science routinely have:

How does corpus A differ from corpus B? — compare(a, b).keyness()
How has discourse around X evolved over time? — track(c, "x").over_time()
What did "migrant" mean in 2005 vs 2023? — compare(...).semantic_shift("migrant", embedder=...)
Did this event actually shift the conversation? — track(...).causal_impact(event_date=...)
Where is the discourse heading? — track(...).forecast(horizon=4)

pycorpdiff is positioned as orchestration, not reinvention. Tokenizers (spaCy, Stanza, jieba, fugashi) and embedders (any SBERT-compatible model) plug in via two typing.Protocol extension points — one-line adapters, no plugin registry. The base install's direct runtime dependencies are numpy, pandas, scipy, and pyarrow; everything else is opt-in via extras.

Status: alpha (0.1.0a27). Public API is stable for the features described below; on PyPI as pip install pycorpdiff. Alpha releases are intentionally rapid (audit-driven), each shipping fixes and tests behind the published version; dependency pins will tighten at beta.

The three-layer architecture

Layer	Purpose	Key surface
1 — Ingestion + `Corpus`	get text in, slice it, hash it	`from_dataframe`, `read_csv`, `read_parquet`, `read_txt`, `read_duckdb`, `from_huggingface`, `fetch_hansard`, `Corpus.slice/by_time/__hash__/doc_term_counts(_sparse)/to_polars`
2 — Pure math	statistics with no I/O	`keyness.{log_likelihood,chi_squared,log_ratio,percent_diff,bayes_factor,permutation_pvalues,keyness_multi,juilland_d,benjamini_hochberg}`; `collocation.{logdice,pmi,t_score,mi_three,collocation_shift,cooccurrence_network}`; `semantic.{HashEmbedder,SBERTEmbedder,semantic_trajectory,neighborhood_drift}`; `temporal.{changepoints,interrupted_time_series,forecast,causal_impact,bocpd}`
3 — Verbs + Results	public API	`compare`, `track`, `compare.before_after`, `keyness_multi`, plus 9 frozen-dataclass Result types each implementing the relevant subset of `.to_df() / .plot() / .explain() / .summary() / .to_html() / .to_json()`

Quick start

pip install "pycorpdiff[viz]"

import pycorpdiff as pcd

# Bundled synthetic Hansard-style sample — runs offline, no data download.
corpus = pcd.load_hansard_sample()
immigration = corpus.slice(topic="immigration")

# Which words separate the humanising and criminalising frames?
keyness = pcd.compare(
    immigration.slice(frame="humanising"),
    immigration.slice(frame="criminalising"),
).keyness(min_count=3)

keyness.plot()                # volcano plot — picture the result
# keyness.table.head(10)      # or look at the ranked table directly
# keyness.explain("criminal") # KWIC concordances showing the textual evidence

That's the entire surface in five lines: load a corpus, slice it, compare two slices, plot the result. Every other analytical method — collocation shifts, semantic drift, temporal trajectories, changepoint detection, causal-impact analysis, forecasting, co-occurrence networks, N-way keyness — follows the same shape. See the showcase notebook for the full feature tour, or the cheat sheet below for one-line API previews.

Cheat sheet — every analytical surface in one block

# Compare verbs (returns Result objects; methods exposed vary by Result)
pcd.compare(a, b).keyness()                                                   # default formula="rayson" (LL Wizard)
pcd.compare(a, b).keyness(formula="dunning")                                  # full 4-cell G² (Dunning 1993; same family as quanteda / NLTK, edge-case tolerance not certified)
pcd.compare(a, b).keyness(ci="bootstrap", n_boot=999)                         # adds g2_ci_lower / g2_ci_upper columns
pcd.compare(a, b).collocation_shift("immigrant")
pcd.compare(a, b).semantic_shift("immigrant", embedder=pcd.SBERTEmbedder())   # [semantic]
# SBERTEmbedder downloads a sentence-transformers model on first call;
# use pcd.HashEmbedder() for offline / deterministic-test settings.

# Reference-baseline keyness (bundled or user-built)
pcd.against_baseline(corpus, "gutenberg_fiction")                             # vs bundled 19th-c. fiction baseline
pcd.against_baseline(corpus, pcd.baseline_from_corpus(reference_corpus))      # vs your own reference

# Sub-corpus balancing — Coarsened Exact Matching before keyness
m = pcd.match(a, b, on=["year", "party"], seed=0)                             # balances A and B on covariates
pcd.compare(m.a_matched, m.b_matched).keyness()                               # like-for-like comparison

# Lexical diversity (TTR, MATTR, MTLD, HD-D) — pooled and over time
pcd.lexical_diversity(corpus)                                                 # pooled corpus-level values
pcd.lexical_diversity(corpus, freq="Y", ci="bootstrap", n_boot=199)           # per-year trajectory + CIs

# Track over time (requires [temporal] for the changepoint + ITS + forecast + causal_impact methods).
# Note: ITS / causal_impact require sufficient pre/post-event periods to fit (min_pre_periods=15,
# min_post_periods=8 by default); the bundled Hansard sample is too small to exercise these
# lines literally -- they are shown here as API previews. See examples/jss_case_study.ipynb
# for a full-corpus run.
tr = pcd.track(corpus, "immigrant").over_time(freq="Y")
tr.changepoints()                                  # offline PELT
tr.changepoints_online(hazard=1/24)                # Bayesian online (Adams & MacKay 2007)
tr.burstiness()                                    # Kleinberg 1999 multi-state HMM — burst-intensity states
# tr.interrupted_time_series(event_date="2016")    # segmented OLS [needs >=15 pre-periods]
# tr.causal_impact(event_date="2016")              # Bayesian counterfactual (Brodersen 2015) [needs >=15 pre-periods]
tr.forecast(horizon=4)                             # 4 periods at the over_time freq (state-space ETS)

# Before / after a known event
pcd.compare.before_after(corpus, event_date="2016-06-23").keyness()

# N-way (≥ 2 corpora) — the four corpora `a, b, c, d` are illustrative placeholders
# (the cheat sheet's `a, b` from the keyness lines above; you supply `c, d`).
# pcd.keyness_multi([a, b, c, d], labels=["A", "B", "C", "D"])

# The discourse as a graph
pcd.cooccurrence_network(corpus, top_n=30).plot()

See examples/pycorpdiff_showcase.ipynb for a walkthrough on the synthetic Hansard-style corpus exercising every analytical surface.

Installation

pip install pycorpdiff                       # lexical-comparative core (MIT)
pip install "pycorpdiff[viz]"                # + altair / matplotlib / networkx
pip install "pycorpdiff[semantic]"           # + sentence-transformers
pip install "pycorpdiff[temporal]"           # + ruptures / statsmodels
pip install "pycorpdiff[notebooks]"          # + jupyter / vl-convert
pip install "pycorpdiff[all]"                # everything MIT-compatible
pip install "pycorpdiff[all,showcase]"       # + pysofra (GPL-3.0-or-later) for the JAMA-style showcase

The base install's direct runtime dependencies are numpy, pandas, scipy, and pyarrow; optional extras land per analytical layer so you only pay for what you use. [showcase] is broken out separately because pysofra is GPL-3.0-or-later — pure pycorpdiff use without that extra remains MIT-only.

To work from source:

git clone https://github.com/jturner-uofl/pycorpdiff
cd pycorpdiff
pip install -e ".[dev]"
pytest -q

Cross-validation receipts

The math is checked against standard tools by automated test. The fast tier runs on every push (matrix CI); the slow tier needs heavy optional dependencies (NLTK, Scattertext, Stanford SNAP downloads) and runs on main pushes only.

Fast tier:

Rayson's LL Wizard — hand-derived contingency-table reference triples (tests/integration/test_crossval_rayson.py)

Slow tier:

NLTK BigramAssocMeasures — PMI + t-score agreement to ≤ 1e-12 on every adjacent bigram
Scattertext (Kessler 2017) — behavioural agreement on the 2012 US Conventions corpus
HistWords (Hamilton et al. 2016) — known-shifter / stable-word sanity check on Stanford SNAP COHA decade embeddings (skips gracefully if the archive isn't reachable)

Citation

If you use pycorpdiff in academic work, please cite the software via the CITATION.cff file in this repository — GitHub renders a "Cite this repository" widget directly from it.

License

MIT — see LICENSE.

Case studies and demos (rendered)

GitHub's in-browser notebook renderer is unreliable on larger notebooks with embedded SVG outputs. The links below point to the pre-rendered HTML artefacts (the canonical read versions) and to nbviewer fallbacks for the .ipynb source. Notebook sources still live under examples/ for re-execution.

asylum case study — lexicalising asylum in UK Parliament, 2010-2023. 📊 rendered HTML · nbviewer · .ipynb source
Full feature tour (showcase). 📊 rendered HTML · nbviewer · .ipynb source
Tutorial. 📊 rendered HTML · .ipynb source
Hansard demo. 📊 rendered HTML · .ipynb source

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0a27 pre-release

May 31, 2026

0.1.0a26 pre-release

May 31, 2026

0.1.0a25 pre-release

May 30, 2026

0.1.0a18 pre-release

May 27, 2026

0.1.0a17 pre-release

May 27, 2026

0.1.0a16 pre-release

May 27, 2026

0.1.0a15 pre-release

May 27, 2026

0.1.0a14 pre-release

May 27, 2026

0.1.0a13 pre-release

May 26, 2026

0.1.0a12 pre-release

May 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycorpdiff-0.1.0a27.tar.gz (321.7 kB view details)

Uploaded May 31, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pycorpdiff-0.1.0a27-py3-none-any.whl (259.9 kB view details)

Uploaded May 31, 2026 Python 3

File details

Details for the file pycorpdiff-0.1.0a27.tar.gz.

File metadata

Download URL: pycorpdiff-0.1.0a27.tar.gz
Upload date: May 31, 2026
Size: 321.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pycorpdiff-0.1.0a27.tar.gz
Algorithm	Hash digest
SHA256	`56cff9cc55859c861c180da529394dab37350aaf0ff815438a39672a6023c55d`
MD5	`ce370ac2673480263b7e02e1b11aed5e`
BLAKE2b-256	`5f38240a88deab5c487d2a4581870f3cda899e4f5d5f953591ee2d5f037cfb03`

See more details on using hashes here.

File details

Details for the file pycorpdiff-0.1.0a27-py3-none-any.whl.

File metadata

Download URL: pycorpdiff-0.1.0a27-py3-none-any.whl
Upload date: May 31, 2026
Size: 259.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pycorpdiff-0.1.0a27-py3-none-any.whl
Algorithm	Hash digest
SHA256	`07ece1ed6dcd4dafeae05b4eebc143ce3685a7e28609514037138bb9720bfa55`
MD5	`26ea6de0442eb3ebf92e0335f39af0b5`
BLAKE2b-256	`9c4607e3f1b4fda62f9922f8b468be25e35d9280bd9abe0c48f717eb9484791b`

See more details on using hashes here.

pycorpdiff 0.1.0a27

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pycorpdiff

The three-layer architecture

Quick start

Cheat sheet — every analytical surface in one block

Installation

Cross-validation receipts

Citation

License

Case studies and demos (rendered)

Further reading

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes