Skip to main content

Fast, low-memory spectral topic modeling via on-the-fly rectification

Project description

anchortopics

PyPI Source code

A fast, low-memory Python implementation of On-the-Fly Rectification (OTFR) spectral topic modeling: Moontae Lee, Sungjun Cho, Kun Dong, David Mimno, and David Bindel, On-the-Fly Rectification for Robust Large-Vocabulary Topic Inference (ICML 2021), building on the Joint-Stochastic Matrix Factorization / Anchor Word framework (Arora et al. 2013; Moontae et al. 2015, JSMF / pyJSMF-RAW).

Spectral ("anchor word") topic models recover topics from low-order word co-occurrence statistics in one shot — no sampling, no EM, just an eigendecomposition and some linear algebra. The catch is that empirical co-occurrence matrices are noisy, indefinite, and dense once rectified, so the classical pipeline costs O(N^2) space and O(N^2 K) time in the vocabulary size N. OTFR avoids ever forming the N x N matrix: it maintains the rectified co-occurrence as an implicit low-rank-plus-sparse operator (Y Y^T + E + r * 1 1^T) and recovers anchor words directly from the low-rank factor Y, bringing the cost down to O(N K) space and O(N K^2) time.

Install

uv add anchortopics   # or: pip install anchortopics

Quickstart

from sklearn.feature_extraction.text import CountVectorizer
from anchortopics import OTFR

vectorizer = CountVectorizer(max_features=20000, stop_words="english")
X = vectorizer.fit_transform(documents)  # (n_documents, n_vocab) sparse counts

model = OTFR(n_topics=20).fit(X)

for k, words in enumerate(model.top_words(vectorizer.get_feature_names_out(), n_words=10)):
    print(k, words)

model.components_ is the (n_topics, n_vocab) word-topic distribution matrix (p(word | topic)), model.topic_correlation_ is the (n_topics, n_topics) topic-topic correlation matrix, and model.anchors_ holds the vocabulary indices selected as anchor words.

If you already have a co-occurrence matrix C (dense or sparse, N x N), use model.fit_cooccurrence(C) instead of fit(X).

For very large vocabularies/corpora, pass randomized_init=True to initialize the rectification with a one-pass randomized eigendecomposition computed directly from the word-document counts (Halko, Martinsson & Tropp 2011), rather than ARPACK.

How it works

  1. CooccurrenceOperator (anchortopics.cooccurrence) builds the unbiased joint-stochastic co-occurrence estimator of Arora et al. as an implicit linear operator over a sparse word-document count matrix, with O(nnz(X)) matrix-vector products — the dense matrix is never formed.
  2. enn_rectify (anchortopics.enn) runs Epsilon Non-Negative (ENN) rectification: iteratively project toward the nearest joint-stochastic, rank-K, (epsilon-)non-negative matrix, representing the non-negativity correction as a sparse matrix E rather than a dense one.
  3. law (anchortopics.law) runs the Low-rank Anchor Word algorithm: selects anchor words via column-pivoted QR performed in the K-dimensional compressed space (equivalent to pivoting on the full matrix, per Lemma 1 of the paper, but O(N K^2) instead of O(N^2 K)), then recovers the word-topic matrix B and topic correlation matrix A.

anchortopics.model.OTFR wires these together behind a small, sklearn-style fit API.

Diagnostics

anchortopics.diagnostics computes everything below from the fitted model's own attributes — no held-out data, human judgments, or extra passes over the corpus required:

from anchortopics import diagnostics

vocab = vectorizer.get_feature_names_out()
report = diagnostics.summary(model, vocabulary=vocab)
print(report["specificity"], report["dissimilarity"])
for c in report["stopword_candidates"]:
    print(c["word"], c["topic_entropy"])
  • stopword_candidates: ranks frequent words whose posterior over topics is close to uniform (high entropy) as candidates to add to a stoplist. On a real abstracts corpus this reliably surfaces academic boilerplate ("furthermore", "state-of-the-art", "leveraging", "address") rather than topical words.
  • relative_approximation: ‖C − BAB^T‖_F / ‖C‖_F against the original (unrectified) co-occurrence, estimated in O(NK) via a Hutchinson trace estimator — never materializes the dense matrix.
  • relative_recovery: how well the selected anchors reconstruct the rest of the normalized co-occurrence space.
  • relative_dominancy: how concentrated the topic-correlation matrix A is on its diagonal (independence) vs. off-diagonal (correlation).
  • specificity: average KL divergence of each topic from the corpus unigram distribution; low values flag generic, high-frequency-word-driven topics.
  • dissimilarity: fraction of each topic's top words that don't recur in any other topic's top words; low values flag redundant topics.
  • eigengap: relative gap between the K-th and (K+1)-th eigenvalues of the rectified spectrum; a shrinking gap as K grows is a sign you're past the number of topics the data actually supports.

These catch degenerate cases (redundant or generic topics, over-large K) rather than confirming topics are meaningful — they're a complement to, not a replacement for, reading the topics. (Not implemented: the paper's MST-Incoherence metric, which needs an NPMI graph over prominent + "characteristic" words per topic — a reasonable follow-up.)

Development

uv sync
uv run pytest
uv run python examples/synthetic_example.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anchortopics-0.1.0.tar.gz (13.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

anchortopics-0.1.0-py3-none-any.whl (18.7 kB view details)

Uploaded Python 3

File details

Details for the file anchortopics-0.1.0.tar.gz.

File metadata

  • Download URL: anchortopics-0.1.0.tar.gz
  • Upload date:
  • Size: 13.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for anchortopics-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9d33932848ac09a8a0919660c606c0dae22ffeac688f0c71631faa3d9ccc1296
MD5 b6325b77e7467e32e54ae2fab24544f1
BLAKE2b-256 ae01cfa24c24c66589eb6bbd35abcc7437e27793b31c2379da2c7b97ea9501c3

See more details on using hashes here.

Provenance

The following attestation bundles were made for anchortopics-0.1.0.tar.gz:

Publisher: publish.yml on mimno/anchortopics

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file anchortopics-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: anchortopics-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 18.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for anchortopics-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2829186390d368ed2da599becd55e0e6ec103591d5c7e9e5e7e478e3806a3ad7
MD5 f46f2dcbb2e6ea8363a8a3843b666e05
BLAKE2b-256 7c8f1522f238ef5f55b1ab4ebfb68ed6792b00373032567427562ac5f08d2377

See more details on using hashes here.

Provenance

The following attestation bundles were made for anchortopics-0.1.0-py3-none-any.whl:

Publisher: publish.yml on mimno/anchortopics

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page