Comparative corpus analysis for Python: keyness, collocations, semantic shift, temporal trajectories with changepoints + causal inference.
Project description
pycorpdiff
Comparative corpus analysis for modern Python workflows.
pycorpdiff is the missing comparative layer between R's
quanteda, the closed-source SketchEngine
platform, and the fragmented Python NLP stack
(nltk/spaCy/gensim/sentence-transformers). Three public verbs
— compare(a, b), track(c, term), compare.before_after(c, event) —
consolidate keyness, collocations, dispersion, temporal trajectories,
changepoint detection, interrupted time series, causal-impact analysis,
forecasting, online changepoint detection, and embedding-based semantic
shift under a single notebook-native API. Every result carries its own
KWIC evidence: .explain(term) returns the source-text concordances
behind any ranked term.
The package answers the questions corpus linguistics, digital humanities, and computational social science routinely have:
- How does corpus A differ from corpus B? —
compare(a, b).keyness() - How has discourse around X evolved over time? —
track(c, "x").over_time() - What did "migrant" mean in 2005 vs 2023? —
compare(...).semantic_shift("migrant", embedder=...) - Did this event actually shift the conversation? —
track(...).causal_impact(event_date=...) - Where is the discourse heading? —
track(...).forecast(horizon=4)
pycorpdiff is positioned as orchestration, not reinvention.
Tokenizers (spaCy, Stanza, jieba, fugashi) and embedders (any
SBERT-compatible model) plug in via two typing.Protocol extension
points — one-line adapters, no plugin registry. The base install pulls
only numpy, pandas, scipy, and pyarrow; everything else is opt-in
via extras.
Status: pre-release alpha (0.1.0a0). Public API is stable for the features described below; PyPI publication is the next milestone.
The three-layer architecture
| Layer | Purpose | Key surface |
|---|---|---|
1 — Ingestion + Corpus |
get text in, slice it, hash it | from_dataframe, read_csv, read_parquet, read_txt, read_duckdb, from_huggingface, fetch_hansard, Corpus.slice/by_time/__hash__/doc_term_counts(_sparse)/to_polars |
| 2 — Pure math | statistics with no I/O | keyness.{log_likelihood,chi_squared,log_ratio,percent_diff,bayes_factor,permutation_pvalues,keyness_multi,juilland_d,benjamini_hochberg}; collocation.{logdice,pmi,t_score,mi_three,collocation_shift,cooccurrence_network}; semantic.{HashEmbedder,SBERTEmbedder,semantic_trajectory,neighborhood_drift}; temporal.{changepoints,interrupted_time_series,forecast,causal_impact,bocpd} |
| 3 — Verbs + Results | public API | compare, track, compare.before_after, keyness_multi, plus 9 frozen-dataclass Result types each with .to_df() / .plot() / .explain() / .summary() / .to_html() / .to_json() |
Quick start
import pycorpdiff as pcd
news = pcd.from_dataframe(df, text_col="body", meta_cols=("outlet", "date"))
# Compare — three verbs
k = pcd.compare(news.slice(outlet="Guardian"), news.slice(outlet="Mail")).keyness()
c = pcd.compare(a, b).collocation_shift("migrant")
s = pcd.compare(a, b).semantic_shift("migrant", embedder=pcd.SBERTEmbedder())
# Track over time
tr = pcd.track(news, "migrant").over_time(freq="Y")
tr.changepoints() # offline PELT
tr.changepoints_online(hazard=1/24) # Bayesian online (Adams & MacKay 2007)
tr.interrupted_time_series(event_date="2016-06-23") # segmented OLS
tr.causal_impact(event_date="2016-06-23") # Bayesian counterfactual (Brodersen 2015)
tr.forecast(horizon=4) # state-space ETS
# Before / after a known event
pcd.compare.before_after(news, event_date="2016-06-23").keyness()
# N-way (≥ 2 corpora)
pcd.keyness_multi([gu, ma, te, mi], labels=["Guardian", "Mail", "Telegraph", "Mirror"])
# The discourse as a graph
pcd.cooccurrence_network(news, top_n=50).plot()
# Every Result: .to_df() · .plot() · .explain() · .summary() · .to_html() · .to_json()
See examples/pycorpdiff_showcase.ipynb
(rendered HTML) for a
walkthrough on a synthetic UK Hansard corpus exercising every analytical
surface.
Installation
Currently a pre-release alpha. From a local clone:
git clone https://github.com/jturner-uofl/pycorpdiff
cd pycorpdiff
pip install -e ".[dev]"
pytest -q # 519 default tests, ~7s
Optional extras: [viz] (altair + matplotlib + networkx), [semantic]
(sentence-transformers + scikit-learn), [temporal] (ruptures +
statsmodels), [polars], [duckdb], [huggingface], [nlp] (spaCy),
[notebooks] (jupyter + vl-convert + pysofra, for the showcase),
or [all].
Cross-validation receipts
The math agrees with the standard tools — by automated test:
- Rayson's LL Wizard — 15 hand-derived contingency-table reference triples
- NLTK
BigramAssocMeasures— PMI + t-score to ≤ 1e-12 on every adjacent bigram - Scattertext (Kessler 2017) — behavioural agreement on the 2012 US Conventions corpus
- quanteda (R) via
rpy2— byte-for-byte G² agreement (slow tier) - HistWords (Hamilton et al. 2016) — diachronic cosine displacements on COHA (slow tier)
Citation
If you use pycorpdiff in academic work, please cite the software via
the CITATION.cff file in this repository — GitHub renders a "Cite this
repository" widget directly from it.
License
MIT — see LICENSE.
Further reading
docs/design.md— three-layer architecturedocs/statistical-methods.md— every metric's formula + citationexamples/pycorpdiff_showcase.ipynb— full feature tour as a notebookdocs/rendered/— self-contained HTML renders of the example notebooks
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pycorpdiff-0.1.0a0.tar.gz.
File metadata
- Download URL: pycorpdiff-0.1.0a0.tar.gz
- Upload date:
- Size: 162.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
751305298f59ef2786de4e9e66c81b7782bc3516476edb46314e2e418e4c58bb
|
|
| MD5 |
0db3f673adc07ce6bcee03f3123ffccc
|
|
| BLAKE2b-256 |
21a0d02ae2e747f8167f36a205675acb689ac96e349a1c63f4d32a873d5f3026
|
File details
Details for the file pycorpdiff-0.1.0a0-py3-none-any.whl.
File metadata
- Download URL: pycorpdiff-0.1.0a0-py3-none-any.whl
- Upload date:
- Size: 122.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c93664a06d546d64bb87b70bfd9304bfcf524c0a2835b4c9af7622a9fef949fd
|
|
| MD5 |
3192e9bd89ea3e5425cdeb16c47b4206
|
|
| BLAKE2b-256 |
8d7b43f205da1e0beb0f5fe49a9a19622c622d5242fd45a2e78d1ca8f8297dae
|