Supervised Semantic Differential (SSD): interpretable, embedding-based analysis of concept meaning in text.
Project description
Supervised Semantic Differential (SSD)
SSD lets you recover interpretable semantic directions related to specific concepts directly from open-ended text and relate them to numeric outcomes (e.g., psychometric scales, judgments) or categorical groups (e.g., clinical diagnosis, experimental condition). It builds per-document concept vectors from local contexts around seed words, learns a semantic gradient (beta) that best predicts the outcome, and then provides multiple interpretability layers:
- Nearest neighbors of each pole (+beta / -beta)
- Clustering of neighbors into themes
- Text snippets: top sentences whose local contexts align with each cluster centroid or the beta axis
- Per-document scores (cosine alignments) for further analysis
- Cross-group comparisons with permutation inference
The method has been presented in the following preprint: https://doi.org/10.31234/osf.io/gvrsb_v3
No-code option: a GUI desktop application for SSD is available at hplisiecki/SSD_APP. It wraps this package into a point-and-click interface with a guided three-stage workflow, interactive lexicon builder, and APA-formatted export — pre-built binaries for Windows, Linux, and macOS are available with no Python installation required.
Table of Contents
- Installation
- Quickstart
- Core Concepts
- Word Embeddings
- Preprocessing (Corpus)
- Lexicon Utilities
- Fitting SSD
- Neighbors & Clustering
- Interpreting with Snippets
- Per-Document SSD Scores
- API Summary
- Citing & License
Installation
pip install ssdiff
Python: 3.10 – 3.14.
Core dependencies (installed automatically): numpy, spacy.
Optional extras:
ssdiff[results]— pandas / openpyxl / python-docx / matplotlib forto_df(),.xlsx/.docxexport, andplot_sweep().ssdiff[gensim]— only needed to save embeddings in.kvformat.
Loading
.kvfiles works without gensim (handled by an internal unpickler shim).
Quickstart
Below is an end-to-end minimal example. Adjust paths and column names to your data.
from ssdiff import Embeddings, Corpus, SSD
import numpy as np
# 1) Load and normalize embeddings
emb = Embeddings.load("path/to/embeddings.txt", verbose=True)
emb.normalize(l2=True, abtt=1)
# 2) Load your data
texts = [...] # list of raw text strings
scores = np.array([...]) # numeric outcome
# 3) Tokenize texts
corpus = Corpus(texts, lang="en") # spaCy tokenization + lemmatization
# 4) Define a lexicon (tokens must match lemmatized forms)
lexicon = ["happy", "sad", "joy", "anger"]
# 5) Build SSD and fit
ssd = SSD(emb, corpus, y=scores, lexicon=lexicon)
result = ssd.fit_pls() # or ssd.fit_ols() for PCA+OLS
# 6) Inspect
print(result.stats) # r², p-value, n_kept, β-norm, IQR effect
print(result.words.pos) # top β-positive neighbours
print(result.words.neg) # top β-negative neighbours
print(result.clusters.pos(topn=100)) # cluster the 100 nearest +β neighbours
result.report().save("report.md")
Every result attribute is a view: print it, slice with (n), dispatch
to one side with .pos / .neg, or export with .to_df() / .save(...).
Core Concepts
- Seed lexicon: a small set of tokens (lemmas) indicating the concept of interest (e.g., {climate, warming, change}).
- Per-document vector: SIF-weighted average of context vectors around each seed occurrence (+-3 tokens), then averaged across occurrences.
- SSD fitting: Learn a semantic gradient (beta) that best predicts the outcome y. Two backends are available:
- PLS: Partial Least Squares regression directly in embedding space.
- PCA+OLS: PCA dimensionality reduction followed by OLS regression (matches original SSD paper).
- Interpretation: nearest neighbors to +beta/-beta, clustering neighbors into themes, and showing original sentences whose local context aligns with centroids or beta.
Word Embeddings
The method requires pre-trained word embeddings in one of the supported formats:
| Format | Extension | Notes |
|---|---|---|
| SSD native | .ssdembed |
Fastest to load (pickle + .vectors.npy sidecar) |
| gensim KeyedVectors | .kv |
Loads without gensim via internal shim |
| word2vec binary | .bin |
Standard binary format |
| Text | .txt, .vec |
One word per line + floats |
| Compressed | .txt.gz, .vec.gz, .bin.gz |
Gzip-compressed versions of the above |
To capture semantic information without frequency-based artifacts, apply L2 normalization and All-But-The-Top (ABTT) transformation:
from ssdiff import Embeddings
emb = Embeddings.load("path/to/model.bin", verbose=True)
emb.normalize(l2=True, abtt=1) # L2 + ABTT (remove top-1 PC)
Calling normalize() with no arguments applies both L2 and ABTT (m=1) by default.
Processing state is tracked — calling it again safely skips already-applied steps.
Tip: Save normalized embeddings as
.ssdembedto preserve both vectors and processing metadata (L2, ABTT state). Other formats (.kv,.bin,.txt) only store raw vectors.
The model is not included in the package and will differ depending on your language and domain. Look for pre-trained static word embeddings in your language with good vocabulary coverage for your domain. GloVe and word2vec trained on large general corpora are a reliable starting point.
For Polish, the nkjp+wiki-lemmas-all-300-cbow-hs.txt.gz (no. 25) from the Polish Word2Vec model list was found to work well.
Preprocessing (Corpus)
The Corpus class encapsulates the full spaCy preprocessing pipeline — tokenization, lemmatization, and stopword removal.
from ssdiff import Corpus
corpus = Corpus(texts, lang="en") # auto-downloads spaCy model if needed
corpus.docs # list[list[str]] — lemmatized tokens per document
corpus.pre_docs # list[PreprocessedDoc] — for snippet extraction
corpus.n_texts # number of documents
You can also pass a pre-loaded spaCy pipeline or pre-tokenized data:
# Custom spaCy pipeline
import spacy
nlp = spacy.load("en_core_web_lg", disable=["ner"])
corpus = Corpus(texts, nlp=nlp)
# Pre-tokenized input
docs = [["happy", "day", "sunshine"], ["sad", "rain", "cold"], ...]
corpus = Corpus(docs, pretokenized=True, lang="en")
Supported languages (20): ca, da, de, el, en, es, fr, hr, it, lt, mk, nb, nl, pl, pt, ro, ru, sl, sv, uk.
CJK languages (Chinese, Japanese, Korean) are not included due to fundamental differences in tokenization and lemmatization. If you need CJK support, you can pass a custom spaCy pipeline via
nlp=and pre-trained embeddings with matching vocabulary.
spaCy models for various languages can be found here. To install a model manually:
python -m spacy download en_core_web_sm
Lexicon Utilities
These helpers make lexicon selection transparent and data-driven (you can also hand-pick tokens). They are methods on Corpus — they operate on the already-lemmatized tokens, so what they score is exactly what SSD will consume.
corpus.suggest_lexicon(y, ...)
Rank tokens by balanced coverage with a mild penalty for strong association with the outcome. Returns a LexiconResult view (printable, exportable, sliceable).
corpus = Corpus(texts, lang="en")
result = corpus.suggest_lexicon(y, top_k=30)
print(result) # tabular view
ssd = SSD(emb, corpus, y, lexicon=result.tokens)
| Argument | Type | Default | Description |
|---|---|---|---|
y |
array-like |
— | Outcome variable (numeric or categorical) |
top_k |
int |
30 |
Maximum number of words to return |
min_docs |
int |
5 |
Minimum document frequency |
n_bins |
int |
4 |
Quantile bins for balanced coverage |
corr_cap |
float |
0.30 |
Penalty threshold for outcome association |
var_type |
str |
"continuous" |
"continuous" or "categorical" |
corpus.evaluate_lexicon(y, lexicon, ...)
Score an existing lexicon against an outcome. Returns a LexiconResult bundling per-token diagnostics (.suggestions) and an aggregate coverage summary (.summary) — both saveable, with .report() producing a narrative markdown overview.
corpus = Corpus(texts, lang="en")
lex = corpus.evaluate_lexicon(y, lexicon=["happy", "sad", "anger"])
print(lex) # tabular view
lex.suggestions.save("tokens.csv") # per-token rows
lex.summary.save("coverage.csv") # aggregate stats
lex.report().save("lexicon.md") # narrative overview
| Argument | Type | Default | Description |
|---|---|---|---|
y |
array-like |
— | Outcome variable (numeric or categorical) |
lexicon |
iterable[str] |
— | Tokens to evaluate (matched against lemmatized corpus) |
n_bins |
int |
4 |
Quantile bins for balanced coverage |
corr_cap |
float |
0.30 |
Penalty threshold for outcome association |
var_type |
str |
"continuous" |
"continuous" or "categorical" |
.suggestions columns: token, freq, cov_all, cov_bal, corr, pvalue, direction, rank.
.summary fields: docs_any, cov_all, q1, q4, corr_any, hits_mean, hits_median, types_mean, types_median (plus group_cov for categorical y).
Fitting SSD
Create an SSD instance with embeddings, corpus, outcome, and lexicon.
The constructor builds document vectors but does not fit a model — call fit_pls(), fit_multipls(), fit_ols(), or fit_groups() explicitly.
from ssdiff import Embeddings, Corpus, SSD
emb = Embeddings.load("model.ssdembed")
emb.normalize(l2=True, abtt=1)
corpus = Corpus(texts, lang="en")
ssd = SSD(
emb, corpus, y=scores,
lexicon=["word1", "word2", "word3"],
window=3, # context window +/-3 tokens around lexicon hits
sif_a=1e-3, # SIF weighting parameter
use_full_doc=False, # False = seed context windows (default)
)
PCA + OLS
Original SSD algorithm from the paper.
result = ssd.fit_ols(
fixed_k=None, # None = auto-select via interpretability+stability sweep
k_min=2,
k_max=120,
k_step=2,
verbose=False,
)
| Argument | Type | Default | Description |
|---|---|---|---|
fixed_k |
int | None |
None |
Fixed PCA components. None = auto-select via sweep |
k_min |
int |
2 |
Minimum PCA-K for sweep |
k_max |
int |
120 |
Maximum PCA-K for sweep |
k_step |
int |
2 |
Step size |
verbose |
bool |
False |
Print progress |
Automatic K selection (PCA sweep)
Selecting the number of PCA components (fixed_k = K) can be a researcher degree of freedom. Pass fixed_k=None (the default) to run an automatic PCA sweep that evaluates a range of K values and selects the most robust solution.
For each candidate PCA dimensionality K, the sweep fits SSD and tracks:
-
Interpretability quality — based on clustering the nearest neighbors at each pole of the semantic gradient and computing aggregate cluster coherence and alignment with beta.
-
Stability of the semantic gradient — measured as the cosine change between consecutive gradients:
beta_delta = 1 - cos(gradient(K-1), gradient(K)). Smaller values mean more stable gradients.
These signals are smoothed using an AUCK window.
result = ssd.fit_ols(fixed_k=None, k_min=2, k_max=120, verbose=True)
print(f"Selected K = {result.n_components}")
print(result.stats)
result.plot_sweep("sweep.png") # save sweep plot
result.plot_sweep() # display interactively
The blue curve shows detrended interpretability as a function of K. The orange curve shows solution stability. The red vertical line marks the selected K.
PLS
PLS regression operates directly in the full embedding space, finding latent directions that maximize covariance between document vectors and the outcome without a separate dimensionality-reduction step. With a single component it recovers one semantic gradient in a single pass. With k="auto" (default) the number of components is selected via selector r2_se); the reported p-value is always the k=1 split-half statistic, independent of selection.
result = ssd.fit_pls(
k="auto", # int, or "auto" for find_k_optimal
k_max=5, # cap for "auto"
n_splits=50, # split_nb iterations
random_state=2137,
verbose=False,
)
| Argument | Type | Default | Description |
|---|---|---|---|
k |
int | "auto" |
"auto" |
Number of PLS components. int fits at exactly that k. "auto" calls plskit.pls1_find_k_optimal (selector r2_se, diagnostic split_nb); p-value is the honest k=1 confirmatory split_nb statistic. |
k_max |
int |
5 |
Cap for k="auto", clamped to min(k_max, n-1, D). Ignored when k is an int. |
n_splits |
int |
50 |
Random splits for the split_nb test. |
random_state |
int |
2137 |
Random seed. |
verbose |
bool |
False |
Print K-selection chain and confirmatory test progress. |
To re-run the test with different settings, call result.test(n_splits=200) — it overwrites result.stats.pvalue and result.test.pvalue in place.
Multi-component PLS (in development)
When you expect more than one interpretable semantic axis related to the outcome, fit_multipls() fits k PLS components and rotates the W-subspace ("varimax" or "raw"). The returned MultiPLSResult is a container of per-dim leaves keyed by "dim-1", "dim-2", … (one per rotated axis).
result = ssd.fit_multipls(
k="auto", # or an int
k_max=5,
rotate="varimax", # or "raw"
rotation_vocab=50_000,
n_splits=50,
random_state=2137,
verbose=False,
)
print(result.stats) # container-level r², pvalue, n_components
print(result.test) # honest k=1 confirmatory split_nb
result.words # pivoted top-words view across rotated dims
result["dim-1"].words # zoom into rotated axis 1
result["dim-1"].clusters.pos # cluster +β neighbours on that axis
| Argument | Type | Default | Description |
|---|---|---|---|
k |
int | "auto" |
"auto" |
Number of PLS components. Same semantics as fit_pls. |
k_max |
int |
5 |
Cap for k="auto", clamped to min(k_max, n-1, D). |
rotate |
"raw" | "varimax" |
"varimax" |
Rotation applied to the W-subspace. |
rotation_vocab |
int | None |
50_000 |
Leading vocabulary rows fed to varimax as the simple-structure target. Assumes frequency-ranked vocab. None uses the full matrix. No-op for rotate="raw". |
n_splits, random_state, verbose |
— | — | Same meaning and defaults as fit_pls. |
Container-level p-value follows fit_pls semantics (honest k=1 confirmatory). Each rotated leaf carries a diagnostic per-dim p-value remapped via the mpls_fit rotation order.
Status. API is stable for research use; feature parity with
PLSResult(per-leaf docs, snippets, misdiagnosed) is still being rolled out. Seeexamples/demo_multipls.pyanddocs/api_reference.md. RAM-efficient embeddings (Embeddings.load(ram_efficient=True)) are not supported byfit_multipls— it needs the full vocabulary as a rotation target.
Cross-Group Comparison
When your research question involves categorical groups rather than a continuous outcome, use ssd.fit_groups().
| Scenario | Use |
|---|---|
| Continuous outcome (scale score, rating) | fit_pls() or fit_ols() |
| Categorical groups (diagnosis, condition) | fit_groups() |
| Continuous outcome AND group labels | Both — fit_pls() for the continuous analysis, fit_groups() for the group comparison |
# Categorical groups
ssd = SSD(emb, corpus, y=group_labels, lexicon=lexicon)
result = ssd.fit_groups(n_perm=5000, correction="holm")
# Or: median split on continuous y
ssd = SSD(emb, corpus, y=scores, lexicon=lexicon)
result = ssd.fit_groups(median_split=True, n_perm=5000)
| Argument | Type | Default | Description |
|---|---|---|---|
median_split |
bool |
False |
Split continuous y into "low"/"high" at median |
n_perm |
int |
5000 |
Permutation iterations |
correction |
str |
"holm" |
P-value correction: "holm", "bonferroni", "fdr_bh", "none" |
random_state |
int |
2137 |
Random seed |
Groups with fewer than 20 documents are automatically dropped.
Groups are canonicalised internally — original labels are remapped to "g1", "g2", … (in sorted order). The original-label mapping is exposed on result.group_labels.
Interpreting group results
print(result) # header + view directory
print(result.stats) # G, n_kept, n_perm, correction, omnibus pvalue
print(result.test) # omnibus pvalue + pairwise contrasts block
# Pairwise rows (T, p_raw, p_corrected, cohens_d, n_g1, n_g2 per contrast)
result.pairs # PairsView — exports via .to_df() / .save()
# Pivoted interpretation across all contrasts (adds a "contrast" column)
result.words.pos
result.clusters.pos(topn=100)
result.snippets.pos
# Zoom into one pair → PairResult (canonical keys: "g1", "g2", ...)
pair = result[("g1", "g2")]
pair.words.pos
pair.clusters.pos
pair.snippets
# Re-run the permutation test with different settings
result.test(n_perm=10_000, correction="fdr_bh")
Key attributes:
result.G— number of retained groups (after the 20-doc minimum filter)result.n_kept— total documents across retained groupsresult.group_labels—dictmapping canonical keys ("g1", …) to original labelsresult.test.omnibus_T,result.test.omnibus_p— omnibus statistic and permutation presult.pairs— list-like view ofPairrows with per-contrastT,p_raw,p_corrected,cohens_d,n_g1,n_g2,contrast_norm
Inspecting results
Both PLSResult and PCAOLSResult share the same interpretation API — everything is a printable, exportable view:
print(result) # header + view/array directory
print(result.stats) # backend, r², r²_adj (OLS only), p, n_kept,
# β-norm, Δ (per +0.10 cosine), IQR effect,
# |corr(y, ŷ)|, y_mean, y_std
print(result.fit_info) # n_components, p_at_k, random_state,
# plus PCA-K sweep info for OLS
# Direct array attributes (numpy ndarrays)
result.beta # raw direction in embedding space
result.gradient # unit-length version of beta
result.beta_norm # ||beta|| (effect-size summary)
result.alignment_scores # per-doc cosine to gradient
result.n_components # number of PLS / PCA components
# Comprehensive narrative report — every section is on by default; pass
# section=False to drop one.
print(result.report(clusters={"n": 100, "n_words": 10, "n_snippets": 2},
extreme_docs={"n": 30}, misdiagnosed={"n": 20}))
result.report().save("report.md") # also .html / .docx / .tex
# Re-run the significance test in place
result.test(n_splits=200) # PLSResult — overwrites stats.pvalue
For MultiPLSResult and GroupResult, see the "Multi-component PLS" and "Cross-Group Comparison" sections above.
Neighbors & Clustering
Nearest neighbors
result.words is a tabular view with columns side, rank, word, cos_beta:
result.words # default: top 20 per pole
result.words.pos # one-sided, default 20 rows
result.words.pos(50) # resize to 50 rows
result.words.neg(None) # all available rows on this side
# Standard view exports
result.words.to_df() # pandas DataFrame
result.words.save("words.csv") # csv / json / md / xlsx / docx / tex
Clustering neighbors into themes
result.clusters k-means clusters the top neighbours per pole (k auto-selected via silhouette unless pinned):
result.clusters.pos # default topn=100
result.clusters.pos(topn=200, k=4) # recompute with different params
result.clusters.pos(cluster_id=0).words # zoom into one cluster
result.clusters.pos(cluster_id=0).snippets # snippets aligned with that centroid
result.clusters.words # flat per-side cluster-words table
# Columns: cluster_id, side, size, coherence, centroid_cos_beta
result.clusters.pos.to_df()
result.clusters.save("clusters.csv")
Interpreting with Snippets
After fitting, SSD lets you link the abstract directions in embedding space back to actual language by inspecting text snippets near seed-word occurrences. Snippets are pulled from the Corpus attached at fit time — no need to pass pre_docs manually.
result.snippets # default: top 30 per pole
result.snippets.pos # SnippetsViewSided, top 30
result.snippets.pos(50) # resize
result.snippets(top_per_side=200, min_cosine=0.1) # recompute extraction
# Snippets aligned with a specific cluster centroid
result.clusters.pos(cluster_id=0).snippets
# Columns: snippet_id, side, doc_id, cosine, seed, start/end indices,
# text_window, text_surface, text_lemmas, cluster_id, contrast
result.snippets.to_df()
result.snippets.save("snippets.xlsx")
The snippet extraction:
- Locates each occurrence of a seed word in the corpus.
- Extracts a small window of surrounding context.
- Represents that window as a SIF-weighted context vector.
- Computes cosine similarity between the context vector and β, ranking snippets by alignment.
Per-Document SSD Scores
result.docs exposes per-document predictions and the cosine alignment score (the SSD score, ⟨d_i, gradient⟩):
result.docs # all rows; columns: doc_id, y_true,
# y_hat, residual, alignment_score
result.docs.pos(20) # 20 most β-positive (highest y_hat)
result.docs.neg(20) # 20 most β-negative
result.docs.id(42) # single-doc detail (incl. raw text)
# Misdiagnosed — largest |residual|
result.docs.misdiagnosed(20) # both over and under
result.docs.misdiagnosed(20, direction="over") # y_hat > y_true
result.docs.misdiagnosed(20, direction="under") # y_hat < y_true
result.docs.to_df()
result.docs.save("docs.csv")
The full per-document alignment vector is also available directly:
result.alignment_scores # ndarray of shape (n_kept,)
API Summary
The ssdiff top-level package exports three primary classes plus result and view classes:
from ssdiff import Embeddings, Corpus, SSD
# Result / view classes (re-exported for type hints, isinstance checks, pickling):
from ssdiff import (
PLSResult, PCAOLSResult, GroupResult, LexiconResult,
WordsView, WordsViewSided, ClustersView, ClustersViewSided,
ClusterWordsView, ClusterWordsViewSided, SnippetsView, SnippetsViewSided,
)
# In-development; not exported at top level:
from ssdiff.results.multi_pls_result import MultiPLSResult
Embeddings
Embeddings.load(path, *, verbose=False, parallel=False, ram_efficient=False)— load.ssdembed,.kv,.bin,.txt,.vec(and.gzvariants).normalize(l2=True, abtt=1, re_normalize=True)— in-place L2 + ABTT; tracks state, safe to call repeatedly.save(filename=None, fmt="ssdembed")— save to native, text, binary, or gensim formatemb["word"]/emb.get_vector("word", norm=False)— vector lookup"word" in emb— membership checklen(emb)/.vocab_size— vocabulary size.vector_size(alias.dim) — embedding dimensionality.similar_by_vector(vec, topn=10, restrict_vocab=None)— nearest neighbor search
Corpus
Corpus(texts, *, lang=None, model=None, nlp=None, stopwords=None, pretokenized=False, auto_download=None).docs— lemmatized tokens per document.pre_docs— sentence-level structure for snippet extraction.n_texts— number of documents.suggest_lexicon(y, *, top_k=30, ...)->LexiconResult— data-driven seed word selection.evaluate_lexicon(y, lexicon, ...)->LexiconResult— score an existing lexicon (per-token + aggregate)
SSD
SSD(embeddings, corpus, y, lexicon, *, window=3, sif_a=1e-3, use_full_doc=False).fit_pls(*, k="auto", k_max=5, n_splits=50, random_state=2137, verbose=False)->PLSResult.fit_multipls(*, k="auto", k_max=5, rotate="varimax", rotation_vocab=50_000, n_splits=50, ...)->MultiPLSResult(in development).fit_ols(*, fixed_k=None, k_min=2, k_max=120, k_step=2, verbose=False)->PCAOLSResult.fit_groups(*, median_split=False, n_perm=5000, correction="holm", random_state=2137, verbose=False)->GroupResult
PLSResult / PCAOLSResult
Direct array attributes: beta, gradient, beta_norm, alignment_scores, n_components, x, y. PLS adds component_scores, component_weights, find_k_result, cv_scores. PCA+OLS adds pca_components, pca_weights, pca_k, sweep_result.
Scalar views (all expose .r2, .pvalue, … as attributes; print to read, export with .to_df() / .save(...)):
.stats—backend,r2,r2_adj(OLS only),pvalue,n_raw,n_kept,n_dropped,y_mean,y_std,beta_norm,delta,iqr_effect,y_corr_pred.fit_info—n_components,p_at_k,n_splits,random_state, plus PCA-K sweep info for OLS
Tabular views:
.words→WordsView(with.pos/.neg→WordsViewSided, callable(n)to resize).clusters→ClustersView(with.pos/.neg→ClustersViewSided, callable(topn=…, k=…)to recompute or(cluster_id)to zoom).snippets→SnippetsView(with.pos/.neg, callable(top_per_side=…)to recompute).docs→DocsViewwith.pos(k),.neg(k),.misdiagnosed(k, direction=…),.id(doc_id).sweep→SweepView(PCA+OLS only) — per-K interpretability/stability rows.test→TestView— callable to re-run the test in place (result.test(n_splits=200)overwritesstats.pvalueandtest.pvalue)
Methods:
.report(clusters=True, top_words=True, extreme_docs=True, misdiagnosed=True)->Report— every section is on by default; passsection=Falseto drop one. Each section toggle isTrue/False/None/dict(e.g.clusters={"n": 20, "n_words": 5, "n_snippets": 1}). Stats + Fit info are always included. Use.to_text(),.to_html(),.save("report.md")..attach(corpus=None, embeddings=None)— re-attach after un-pickling.plot_sweep(path=None)— PCA-K sweep chart (PCAOLSResultonly)
GroupResult
Direct attributes: G, n_kept, n_perm, correction, random_state, group_labels (canonical → original label dict), x, groups, beta, gradient, beta_norm, alignment_scores.
Views: .stats, .test (omnibus pvalue, omnibus_T, omnibus_p), .pairs (per-contrast T, p_raw, p_corrected, cohens_d, n_g1, n_g2), .words, .clusters, .snippets (all pivoted across contrasts, add a contrast column).
Pair access: result[("g1", "g2")] → PairResult (canonical keys only) with its own .words, .clusters, .snippets, .gradient, .beta. Use result.keys() to list available pair keys; result.group_labels to map canonical → original.
Methods: .report(clusters=True, top_words=True) — both sections on by default; pass section=False to drop one. Each toggle is True / False / None / dict (e.g. clusters={"n": 20, "n_words": 5, "n_snippets": 1}). Omnibus + Group labels + Pairwise contrasts are always included. .test(n_perm=…, correction=…) (re-runs in place); .attach(...).
Lexicon utilities
The lexicon helpers are methods on Corpus, not standalone imports:
corpus = Corpus(texts, lang="en")
suggestions = corpus.suggest_lexicon(y, top_k=30) # → LexiconResult
lex = corpus.evaluate_lexicon(y, lexicon=["happy", "sad"]) # → LexiconResult
LexiconResult views (.suggestions, .summary) and .report() support .to_df() (requires ssdiff[results]), .to_dict(), .to_records(), and .save("file.{csv,json,md,xlsx,docx,tex,html}").
Citing & License
- License: GPL v3 (see LICENSE).
- If you use SSD in published work, please cite the associated paper.
- A suggested citation:
Plisiecki, H., Lenartowicz, P., Pokropek, A., Malyska, K., & Flakus, M. (2025). Measuring Individual Differences in Meaning: The Supervised Semantic Differential. PsyArXiv. https://doi.org/10.31234/osf.io/gvrsb_v1
Questions / Contributions
- File issues and feature requests on the repo's Issues page.
- Pull requests welcome — especially for:
- Robustness diagnostics and visualization helpers
- Documentation improvements
Contact: hplisiecki@gmail.com
Project was funded by the National Science Centre, Poland (grant no. 2020/38/E/HS6/00302).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ssdiff-3.0.0.tar.gz.
File metadata
- Download URL: ssdiff-3.0.0.tar.gz
- Upload date:
- Size: 167.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9b54992d9a8c16849c215417560cf25321c23a7e3cad63858808fc29dff7fddf
|
|
| MD5 |
80cb606b709826179ef3092e81378c48
|
|
| BLAKE2b-256 |
35b4564142e08e9985e4d0f221336ebf791fe7b278f2c8fd733ca0f523a79d9e
|
File details
Details for the file ssdiff-3.0.0-py3-none-any.whl.
File metadata
- Download URL: ssdiff-3.0.0-py3-none-any.whl
- Upload date:
- Size: 156.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ff65203a4eb27c19353825e00bf79958f1a85b208e3fc7803455e102fd445ec
|
|
| MD5 |
aaa13c591dd952409e71e007d02e0901
|
|
| BLAKE2b-256 |
9380203efc00f28314a0ab751fd96dccedc119a962fbc003f43d378010e15265
|