Comparative corpus analysis for Python: keyness, collocations, semantic shift, temporal trajectories with changepoints + causal inference.

These details have not been verified by PyPI

Project links

Project description

pycorpdiff

Comparative corpus analysis for modern Python workflows.

pycorpdiff is the missing comparative layer between R's quanteda, the closed-source SketchEngine platform, and the fragmented Python NLP stack (nltk/spaCy/gensim/sentence-transformers). Three public verbs — compare(a, b), track(c, term), compare.before_after(c, event) — consolidate keyness, collocations, dispersion, temporal trajectories, changepoint detection, interrupted time series, causal-impact analysis, forecasting, online changepoint detection, and embedding-based semantic shift under a single notebook-native API. Every result carries its own KWIC evidence: .explain(term) returns the source-text concordances behind any ranked term.

The package answers the questions corpus linguistics, digital humanities, and computational social science routinely have:

How does corpus A differ from corpus B? — compare(a, b).keyness()
How has discourse around X evolved over time? — track(c, "x").over_time()
What did "migrant" mean in 2005 vs 2023? — compare(...).semantic_shift("migrant", embedder=...)
Did this event actually shift the conversation? — track(...).causal_impact(event_date=...)
Where is the discourse heading? — track(...).forecast(horizon=4)

pycorpdiff is positioned as orchestration, not reinvention. Tokenizers (spaCy, Stanza, jieba, fugashi) and embedders (any SBERT-compatible model) plug in via two typing.Protocol extension points — one-line adapters, no plugin registry. The base install pulls only numpy, pandas, scipy, and pyarrow; everything else is opt-in via extras.

Status: pre-release alpha (0.1.0a0). Public API is stable for the features described below; PyPI publication is the next milestone.

The three-layer architecture

Layer	Purpose	Key surface
1 — Ingestion + `Corpus`	get text in, slice it, hash it	`from_dataframe`, `read_csv`, `read_parquet`, `read_txt`, `read_duckdb`, `from_huggingface`, `fetch_hansard`, `Corpus.slice/by_time/__hash__/doc_term_counts(_sparse)/to_polars`
2 — Pure math	statistics with no I/O	`keyness.{log_likelihood,chi_squared,log_ratio,percent_diff,bayes_factor,permutation_pvalues,keyness_multi,juilland_d,benjamini_hochberg}`; `collocation.{logdice,pmi,t_score,mi_three,collocation_shift,cooccurrence_network}`; `semantic.{HashEmbedder,SBERTEmbedder,semantic_trajectory,neighborhood_drift}`; `temporal.{changepoints,interrupted_time_series,forecast,causal_impact,bocpd}`
3 — Verbs + Results	public API	`compare`, `track`, `compare.before_after`, `keyness_multi`, plus 9 frozen-dataclass Result types each with `.to_df() / .plot() / .explain() / .summary() / .to_html() / .to_json()`

Quick start

import pycorpdiff as pcd

news = pcd.from_dataframe(df, text_col="body", meta_cols=("outlet", "date"))

# Compare — three verbs
k = pcd.compare(news.slice(outlet="Guardian"), news.slice(outlet="Mail")).keyness()
c = pcd.compare(a, b).collocation_shift("migrant")
s = pcd.compare(a, b).semantic_shift("migrant", embedder=pcd.SBERTEmbedder())

# Track over time
tr = pcd.track(news, "migrant").over_time(freq="Y")
tr.changepoints()                                     # offline PELT
tr.changepoints_online(hazard=1/24)                   # Bayesian online (Adams & MacKay 2007)
tr.interrupted_time_series(event_date="2016-06-23")   # segmented OLS
tr.causal_impact(event_date="2016-06-23")             # Bayesian counterfactual (Brodersen 2015)
tr.forecast(horizon=4)                                # state-space ETS

# Before / after a known event
pcd.compare.before_after(news, event_date="2016-06-23").keyness()

# N-way (≥ 2 corpora)
pcd.keyness_multi([gu, ma, te, mi], labels=["Guardian", "Mail", "Telegraph", "Mirror"])

# The discourse as a graph
pcd.cooccurrence_network(news, top_n=50).plot()

# Every Result: .to_df() · .plot() · .explain() · .summary() · .to_html() · .to_json()

See examples/pycorpdiff_showcase.ipynb (rendered HTML) for a walkthrough on a synthetic UK Hansard corpus exercising every analytical surface.

Installation

Currently a pre-release alpha. From a local clone:

git clone https://github.com/jturner-uofl/pycorpdiff
cd pycorpdiff
pip install -e ".[dev]"
pytest -q                          # 519 default tests, ~7s

Optional extras: [viz] (altair + matplotlib + networkx), [semantic] (sentence-transformers + scikit-learn), [temporal] (ruptures + statsmodels), [polars], [duckdb], [huggingface], [nlp] (spaCy), [notebooks] (jupyter + vl-convert + pysofra, for the showcase), or [all].

Cross-validation receipts

The math agrees with the standard tools — by automated test:

Rayson's LL Wizard — 15 hand-derived contingency-table reference triples
NLTK BigramAssocMeasures — PMI + t-score to ≤ 1e-12 on every adjacent bigram
Scattertext (Kessler 2017) — behavioural agreement on the 2012 US Conventions corpus
quanteda (R) via rpy2 — byte-for-byte G² agreement (slow tier)
HistWords (Hamilton et al. 2016) — diachronic cosine displacements on COHA (slow tier)

Citation

If you use pycorpdiff in academic work, please cite the software via the CITATION.cff file in this repository — GitHub renders a "Cite this repository" widget directly from it.

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0a0 pre-release

May 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycorpdiff-0.1.0a0.tar.gz (162.5 kB view details)

Uploaded May 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pycorpdiff-0.1.0a0-py3-none-any.whl (122.8 kB view details)

Uploaded May 24, 2026 Python 3

File details

Details for the file pycorpdiff-0.1.0a0.tar.gz.

File metadata

Download URL: pycorpdiff-0.1.0a0.tar.gz
Upload date: May 24, 2026
Size: 162.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pycorpdiff-0.1.0a0.tar.gz
Algorithm	Hash digest
SHA256	`751305298f59ef2786de4e9e66c81b7782bc3516476edb46314e2e418e4c58bb`
MD5	`0db3f673adc07ce6bcee03f3123ffccc`
BLAKE2b-256	`21a0d02ae2e747f8167f36a205675acb689ac96e349a1c63f4d32a873d5f3026`

See more details on using hashes here.

File details

Details for the file pycorpdiff-0.1.0a0-py3-none-any.whl.

File metadata

Download URL: pycorpdiff-0.1.0a0-py3-none-any.whl
Upload date: May 24, 2026
Size: 122.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pycorpdiff-0.1.0a0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c93664a06d546d64bb87b70bfd9304bfcf524c0a2835b4c9af7622a9fef949fd`
MD5	`3192e9bd89ea3e5425cdeb16c47b4206`
BLAKE2b-256	`8d7b43f205da1e0beb0f5fe49a9a19622c622d5242fd45a2e78d1ca8f8297dae`

See more details on using hashes here.

pycorpdiff 0.1.0a0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pycorpdiff

The three-layer architecture

Quick start

Installation

Cross-validation receipts

Citation

License

Further reading

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes