Fast, low-memory spectral topic modeling via on-the-fly rectification

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

dmimno

These details have not been verified by PyPI

Project links

Paper

Project description

anchortopics

Source code

A fast, low-memory Python implementation of On-the-Fly Rectification (OTFR) spectral topic modeling: Moontae Lee, Sungjun Cho, Kun Dong, David Mimno, and David Bindel, On-the-Fly Rectification for Robust Large-Vocabulary Topic Inference (ICML 2021), building on the Joint-Stochastic Matrix Factorization / Anchor Word framework (Arora et al. 2013; Moontae et al. 2015, JSMF / pyJSMF-RAW).

Spectral ("anchor word") topic models recover topics from low-order word co-occurrence statistics in one shot — no sampling, no EM, just an eigendecomposition and some linear algebra. The catch is that empirical co-occurrence matrices are noisy, indefinite, and dense once rectified, so the classical pipeline costs O(N^2) space and O(N^2 K) time in the vocabulary size N. OTFR avoids ever forming the N x N matrix: it maintains the rectified co-occurrence as an implicit low-rank-plus-sparse operator (Y Y^T + E + r * 1 1^T) and recovers anchor words directly from the low-rank factor Y, bringing the cost down to O(N K) space and O(N K^2) time.

Install

uv add anchortopics   # or: pip install anchortopics

Quickstart

from sklearn.feature_extraction.text import CountVectorizer
from anchortopics import OTFR

vectorizer = CountVectorizer(max_features=20000, stop_words="english")
X = vectorizer.fit_transform(documents)  # (n_documents, n_vocab) sparse counts

model = OTFR(n_topics=20).fit(X)

for k, words in enumerate(model.top_words(vectorizer.get_feature_names_out(), n_words=10)):
    print(k, words)

model.components_ is the (n_topics, n_vocab) word-topic distribution matrix (p(word | topic)), model.topic_correlation_ is the (n_topics, n_topics) topic-topic correlation matrix, and model.anchors_ holds the vocabulary indices selected as anchor words.

If you already have a co-occurrence matrix C (dense or sparse, N x N), use model.fit_cooccurrence(C) instead of fit(X).

For very large vocabularies/corpora, pass randomized_init=True to initialize the rectification with a one-pass randomized eigendecomposition computed directly from the word-document counts (Halko, Martinsson & Tropp 2011), rather than ARPACK.

How it works

CooccurrenceOperator (anchortopics.cooccurrence) builds the unbiased joint-stochastic co-occurrence estimator of Arora et al. as an implicit linear operator over a sparse word-document count matrix, with O(nnz(X)) matrix-vector products — the dense matrix is never formed.
enn_rectify (anchortopics.enn) runs Epsilon Non-Negative (ENN) rectification: iteratively project toward the nearest joint-stochastic, rank-K, (epsilon-)non-negative matrix, representing the non-negativity correction as a sparse matrix E rather than a dense one.
law (anchortopics.law) runs the Low-rank Anchor Word algorithm: selects anchor words via column-pivoted QR performed in the K-dimensional compressed space (equivalent to pivoting on the full matrix, per Lemma 1 of the paper, but O(N K^2) instead of O(N^2 K)), then recovers the word-topic matrix B and topic correlation matrix A.

anchortopics.model.OTFR wires these together behind a small, sklearn-style fit API.

Diagnostics

anchortopics.diagnostics computes everything below from the fitted model's own attributes — no held-out data, human judgments, or extra passes over the corpus required:

from anchortopics import diagnostics

vocab = vectorizer.get_feature_names_out()
report = diagnostics.summary(model, vocabulary=vocab)
print(report["specificity"], report["dissimilarity"])
for c in report["stopword_candidates"]:
    print(c["word"], c["topic_entropy"])

stopword_candidates: ranks frequent words whose posterior over topics is close to uniform (high entropy) as candidates to add to a stoplist. On a real abstracts corpus this reliably surfaces academic boilerplate ("furthermore", "state-of-the-art", "leveraging", "address") rather than topical words.
relative_approximation: ‖C − BAB^T‖_F / ‖C‖_F against the original (unrectified) co-occurrence, estimated in O(NK) via a Hutchinson trace estimator — never materializes the dense matrix.
relative_recovery: how well the selected anchors reconstruct the rest of the normalized co-occurrence space.
relative_dominancy: how concentrated the topic-correlation matrix A is on its diagonal (independence) vs. off-diagonal (correlation).
specificity: average KL divergence of each topic from the corpus unigram distribution; low values flag generic, high-frequency-word-driven topics.
dissimilarity: fraction of each topic's top words that don't recur in any other topic's top words; low values flag redundant topics.
eigengap: relative gap between the K-th and (K+1)-th eigenvalues of the rectified spectrum; a shrinking gap as K grows is a sign you're past the number of topics the data actually supports.

These catch degenerate cases (redundant or generic topics, over-large K) rather than confirming topics are meaningful — they're a complement to, not a replacement for, reading the topics. (Not implemented: the paper's MST-Incoherence metric, which needs an NPMI graph over prominent + "characteristic" words per topic — a reasonable follow-up.)

Development

uv sync
uv run pytest
uv run python examples/synthetic_example.py

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

dmimno

These details have not been verified by PyPI

Project links

Paper

Release history Release notifications | RSS feed

This version

0.1.0

Jun 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anchortopics-0.1.0.tar.gz (13.9 kB view details)

Uploaded Jun 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

anchortopics-0.1.0-py3-none-any.whl (18.7 kB view details)

Uploaded Jun 23, 2026 Python 3

File details

Details for the file anchortopics-0.1.0.tar.gz.

File metadata

Download URL: anchortopics-0.1.0.tar.gz
Upload date: Jun 23, 2026
Size: 13.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for anchortopics-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`9d33932848ac09a8a0919660c606c0dae22ffeac688f0c71631faa3d9ccc1296`
MD5	`b6325b77e7467e32e54ae2fab24544f1`
BLAKE2b-256	`ae01cfa24c24c66589eb6bbd35abcc7437e27793b31c2379da2c7b97ea9501c3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for anchortopics-0.1.0.tar.gz:

Publisher: publish.yml on mimno/anchortopics

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: anchortopics-0.1.0.tar.gz
- Subject digest: 9d33932848ac09a8a0919660c606c0dae22ffeac688f0c71631faa3d9ccc1296
- Sigstore transparency entry: 1930012777
- Sigstore integration time: Jun 23, 2026
Source repository:
- Permalink: mimno/anchortopics@994fd2aff96c3bfaa621186653848ee9d6777929
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/mimno
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@994fd2aff96c3bfaa621186653848ee9d6777929
- Trigger Event: release

File details

Details for the file anchortopics-0.1.0-py3-none-any.whl.

File metadata

Download URL: anchortopics-0.1.0-py3-none-any.whl
Upload date: Jun 23, 2026
Size: 18.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for anchortopics-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2829186390d368ed2da599becd55e0e6ec103591d5c7e9e5e7e478e3806a3ad7`
MD5	`f46f2dcbb2e6ea8363a8a3843b666e05`
BLAKE2b-256	`7c8f1522f238ef5f55b1ab4ebfb68ed6792b00373032567427562ac5f08d2377`

See more details on using hashes here.

Provenance

The following attestation bundles were made for anchortopics-0.1.0-py3-none-any.whl:

Publisher: publish.yml on mimno/anchortopics

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: anchortopics-0.1.0-py3-none-any.whl
- Subject digest: 2829186390d368ed2da599becd55e0e6ec103591d5c7e9e5e7e478e3806a3ad7
- Sigstore transparency entry: 1930012943
- Sigstore integration time: Jun 23, 2026
Source repository:
- Permalink: mimno/anchortopics@994fd2aff96c3bfaa621186653848ee9d6777929
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/mimno
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@994fd2aff96c3bfaa621186653848ee9d6777929
- Trigger Event: release

anchortopics 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

anchortopics

Install

Quickstart

How it works

Diagnostics

Development

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance