SCPTM: Structural Contextual Probabilistic Topic Model — a VAE-GNN topic model with syntactic dependency graphs, contextual word embeddings, and beta temperature scaling.

These details have not been verified by PyPI

Project links

Homepage

Project description

SCPTM — Structural Contextual Probabilistic Topic Model

A VAE-based topic model that combines heterogeneous graph neural networks over syntactic dependency graphs with contextual SBERT word embeddings.

Architecture overview

Documents ──SBERT──► doc embeddings ┐
                                    ├─► HeteroConv/GAT ──► μ, logσ² ──► z ──► θ (topic mix)
Vocabulary ──SBERT──► word embeddings ┘                                          │
                           │                                                     │
                     K-means init                                                 │
                           │                                                     ▼
                     topic_embeddings ──cosine/T──► β (topic×vocab) ──θ·β──► recon loss

Key design choices:

Component	What it does
HeteroConv / GAT encoder	Propagates information through doc→word, word→word (syntax) and word→doc edges to produce per-document latent representations
Contextual beta	At evaluation: per-word topic affinity computed via attention pooling over SBERT sentence embeddings. At training: differentiable cosine similarity with temperature scaling
Beta temperature (T=0.1)	Cosine similarities in R³⁸⁴ concentrate near 0 (std≈1/√384≈0.051). Dividing by T maps them to ≈[−10,+10], making the softmax discriminative and gradients non-zero
Word k-means init	Topic embeddings are initialised from k-means centroids of the word embedding space (not documents), guaranteeing high cosine similarity with nearby vocabulary words from epoch 1
VAE with KL annealing	Linear/cyclical schedule + free bits (per-dimension KL floor) to prevent posterior collapse
Topic diversity loss	Cosine repulsion between topic embedding pairs to prevent topic collapse

Installation

From PyPI:

pip install scptm

# With comparison benchmarks (BERTopic, CTM)
pip install "scptm[benchmark]"

# All optional dependencies
pip install "scptm[full]"

For development (editable install):

git clone https://github.com/a-meneghini/scptm.git
cd scptm
pip install -e ".[dev]"

Required spaCy models:

python -m spacy download en_core_web_sm   # English
python -m spacy download it_core_news_sm  # Italian

Note on torch-geometric: SCPTM depends on PyTorch Geometric (torch-geometric>=2.4), which is available on standard PyPI. If you need CUDA-accelerated graph operations, install the CUDA-specific wheel first following the official PyG installation guide before installing SCPTM. CPU-only installs work out of the box with pip install scptm.

Quick start

from scptm import SCPTM, SCPTMConfig

documents = [
    "Machine learning is transforming healthcare diagnostics.",
    "Deep neural networks achieve state-of-the-art performance in NLP.",
    "Climate change accelerates biodiversity loss in tropical regions.",
    # ... hundreds more
]

# One-liner with defaults (10 topics, filtered syntax graph, English)
model = SCPTM()
theta = model.fit_transform(documents)    # (n_docs, K) topic mixtures

# Topic overview
model.get_topic_info(top_k=10)

# Out-of-sample inference
new_theta = model.transform(["A new document about AI research."])

# Evaluation
metrics = model.evaluate()
print(metrics)
# → {'npmi_coherence': 0.12, 'topic_diversity': 0.87, ...}

# Persist and reload
model.save("my_model.pkl")
model2 = SCPTM.load("my_model.pkl")

Configuration

All hyper-parameters live in SCPTMConfig. Passing keyword arguments to SCPTM() directly is a shorthand for SCPTM(config=SCPTMConfig(...)).

from scptm import SCPTM, SCPTMConfig

cfg = SCPTMConfig(
    # ── Model ──────────────────────────────────────────────────────────────
    num_topics          = 10,
    hidden_channels     = 64,       # GNN/MLP hidden size per attention head

    # ── Graph ──────────────────────────────────────────────────────────────
    graph_mode          = "filtered",
    # "none"      — no graph; pure MLP encoder (CTM-like baseline)
    # "no_syntax" — doc-word edges only, no word-word edges
    # "full_dep"  — all content dependency types
    # "filtered"  — informative dependency types only (default, recommended)

    # ── Training ───────────────────────────────────────────────────────────
    epochs              = 50,
    lr                  = 5e-3,
    batch_size          = 256,
    kl_max              = 1.0,
    kl_warmup_epochs    = 20,
    kl_strategy         = "linear",   # "linear" | "cyclical" | "constant"
    free_bits           = 0.1,        # per-dimension KL floor
    n_mc_samples        = 1,          # >1 enables MC uncertainty report

    # ── Beta ───────────────────────────────────────────────────────────────
    beta_temperature    = 0.1,        # softmax sharpening (lower = sharper)
    beta_refresh_epochs = 5,          # recompute contextual beta every N epochs
    max_ctx_occurrences = 50,         # max SBERT contexts stored per word

    # ── Regularisation ─────────────────────────────────────────────────────
    topic_diversity_weight = 0.1,     # cosine repulsion between topic embeddings

    # ── Corpus ─────────────────────────────────────────────────────────────
    lang                = "eng",      # "eng" | "ita"
    min_df              = 5,
    max_features        = 15_000,
    apply_chunking      = True,
    max_chunk_chars     = 800,

    # ── Keyword extraction ─────────────────────────────────────────────────
    bow_normalization   = "tf",       # "none" | "tf" | "log1p"
    keyword_method      = "cosine",   # "cosine" | "ctfidf"

    # ── Hardware ───────────────────────────────────────────────────────────
    use_mixed_precision = True,       # AMP on CUDA
    use_neighbor_sampling = False,    # NeighborLoader for large corpora

    # ── Reproducibility ────────────────────────────────────────────────────
    random_state        = 42,
)

model = SCPTM(config=cfg)

Parse and embedding cache

spaCy lemmatisation, dependency parsing, and contextual SBERT embeddings are the dominant cost on large corpora. Passing edge_cache_path persists all of them to a single pickle file and skips re-computation on subsequent runs.

# First run — parses corpus, encodes contextual embeddings, writes cache
theta = model.fit_transform(documents, edge_cache_path="corpus.pkl")

# Subsequent runs — skips spaCy and SBERT contextual pass entirely
model2 = SCPTM(config=cfg)
theta2 = model2.fit_transform(documents, edge_cache_path="corpus.pkl")

The cache stores: vocabulary, BoW matrix, dependency edge lists, and the per-word contextual SBERT embeddings. If the corpus size or vocabulary changes, the stale cache is detected automatically and rebuilt.

Keyword extraction methods

# Set globally
cfg = SCPTMConfig(keyword_method="ctfidf")

# Or override per call
model.get_topic_info(top_k=10, method="cosine")
model.get_topic_info(top_k=10, method="ctfidf")
model.get_topics_dict(top_k=5)          # returns single words + bigrams/trigrams

Method	Ranks by	Best for
`"cosine"` (default)	Cosine similarity between topic embedding and context-pooled word embedding	Semantically central terms
`"ctfidf"`	Class-based TF-IDF (each topic treated as a document class)	Discriminative / distinctive terms

Iterative refinement

Alternates between standard training and blending document embeddings toward their dominant topic centroid. Useful when the initial embedding space lacks clear cluster structure.

theta = model.fit(
    documents,
    iterative_refinement = True,
    n_refinement_steps   = 3,     # train → refine → train → ... (N steps)
    refinement_blend     = 0.2,   # alpha: 0 = no blend, 1 = full centroid
).theta

Uncertainty quantification (Monte Carlo)

cfg = SCPTMConfig(n_mc_samples=20)
model = SCPTM(config=cfg)
model.fit(documents)

# Per-document uncertainty regime
df = model.get_uncertainty_report()
# Columns: doc_id, regime, mean_std_mc, entropy_theta, dominant_topic, ...
# Regimes: CERTAIN | MODERATE | AMBIGUOUS | POORLY_ENCODED

Comparison with baselines

To compare SCPTM against a CTM-like baseline and a TriTopic-like baseline:

import pandas as pd
from scptm import SCPTM, SCPTMConfig

BASE = dict(num_topics=10, lang='eng', epochs=50, apply_chunking=False)

# CTM-like (no graph, MLP encoder only)
m_ctm = SCPTM(**BASE, graph_mode='none')
m_ctm.fit_transform(docs)
r_ctm = m_ctm.evaluate()

# TriTopic-like (no graph + iterative embedding refinement)
m_tri = SCPTM(**BASE, graph_mode='none')
m_tri.fit(docs, iterative_refinement=True, n_refinement_steps=3, refinement_blend=0.2)
r_tri = m_tri.evaluate()

# SCPTM with filtered syntax graph
m_full = SCPTM(**BASE, graph_mode='filtered')
m_full.fit_transform(docs)
r_full = m_full.evaluate()

# SCPTM + refinement
m_best = SCPTM(**BASE, graph_mode='filtered')
m_best.fit(docs, iterative_refinement=True, n_refinement_steps=3, refinement_blend=0.2)
r_best = m_best.evaluate()

rows = [
    ("CTM (no graph)",                r_ctm),
    ("TriTopic-like (no graph+refine)", r_tri),
    ("SCPTM (GNN filtered)",          r_full),
    ("SCPTM + refine",                r_best),
]
df = pd.DataFrame([
    {"model": name,
     "npmi": round(r.get("npmi_coherence", float("nan")), 3),
     "diversity": round(r.get("topic_diversity", float("nan")), 3)}
    for name, r in rows
])
print(df.to_string(index=False))

For a full sweep across all four graph modes:

results = SCPTM.run_ablation_study(documents, epochs=50)

Visualisations

model.plot_training()     # loss + KL annealing + NPMI + diversity curves
model.visualize_3d()      # interactive Plotly 3D semantic constellation
model.visualize_2d()      # high-res PNG for papers (300 dpi)

Architecture comparison

	LDA	BERTopic	CTM	TriTopic	SCPTM
Model type	Generative (BoW)	Clustering	VAE	Clustering + refinement	VAE-GNN
Input signal	Co-occurrence	Embeddings	SBERT	SBERT	SBERT + syntax
Syntactic graph	✗	✗	✗	✗	✓
Contextual word embeddings	✗	✗	✓	✓	✓
Out-of-sample inference	✓	✓	✓	✓	✓
MC uncertainty	✗	✗	✗	✗	✓
Iterative refinement	✗	✗	✗	✓	✓ (optional)
Multilingual	✗	✓	✓	partial	✓ (eng/ita)
Embedding cache	✗	✗	✗	✗	✓

When does the syntax graph help? On formal corpora (scientific papers, news, legal documents) syntactic dependencies carry strong discriminative signal. On short informal text (social media, chat) the gap over a CTM baseline is smaller; use graph_mode="none" as a fast sanity-check.

Notes on metrics

NPMI coherence measures how often a topic's top words co-occur in documents. Typical target: > 0.10. Scores < 0 are common on short informal text (Reddit, chat, social media) where words appear in isolation rather than in recurring co-occurrence patterns — this is a property of the corpus, not a model failure.

Topic diversity = fraction of unique words across all topic top-word lists. Score in [0, 1]; > 0.70 is generally considered good.

Citation

@software{meneghini2026scptm,
  author  = {Meneghini, Alessandro},
  title   = {{SCPTM}: Structural Contextual Probabilistic Topic Model},
  year    = {2026},
  url     = {https://github.com/a-meneghini/scptm}
}

License

MIT

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.2.0

Jun 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scptm-0.2.0.tar.gz (45.7 kB view details)

Uploaded Jun 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scptm-0.2.0-py3-none-any.whl (45.6 kB view details)

Uploaded Jun 12, 2026 Python 3

File details

Details for the file scptm-0.2.0.tar.gz.

File metadata

Download URL: scptm-0.2.0.tar.gz
Upload date: Jun 12, 2026
Size: 45.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for scptm-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`cae9bbd73665999fb61cf008899369ade5172ba87687404f85cafe1065811a53`
MD5	`f4e3d8d14decece1af2f8dbe10f50b08`
BLAKE2b-256	`f61deb656ed5b5d2e5583a3810efd6662c34032b2a306d124d8bb0a9ea1dce55`

See more details on using hashes here.

File details

Details for the file scptm-0.2.0-py3-none-any.whl.

File metadata

Download URL: scptm-0.2.0-py3-none-any.whl
Upload date: Jun 12, 2026
Size: 45.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for scptm-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`52fdcdde6bca7d4ca52217ebb01502531913fc70c7ba5b73bd64ded29374eb50`
MD5	`2e1b64c0bfddcb97c18af1791e28c02d`
BLAKE2b-256	`94254b0897ac5c0623f60ed9a4aca8ee284c2c6b135e204ba307af4e786dfe78`

See more details on using hashes here.

scptm 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SCPTM — Structural Contextual Probabilistic Topic Model

Architecture overview

Installation

Quick start

Configuration

Parse and embedding cache

Keyword extraction methods

Iterative refinement

Uncertainty quantification (Monte Carlo)

Comparison with baselines

Visualisations

Architecture comparison

Notes on metrics

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes