Skip to main content

Topica: fast, all-purpose topic modeling for Python — a Rust core for LDA, STM, and more

Project description

Topica: fast, all-purpose topic modeling for Python

📖 Documentation: guides, a full API reference, worked examples, and a Publishing in a social science journal methodology track.

topica is a topic-modeling library with a Rust core and a numpy-native Python API. It covers a family of models (from classic LDA to the Structural Topic Model), fits them in native code (no JVM, no pure-Python inner loops), and keeps every fit deterministic for a given seed. It is built for working social scientists, pairing each model with the validation, covariate-effect, and reporting tools to meet the standards reviewers expect. Pass pre-tokenized list[list[str]] (or a Corpus); get back plain numpy arrays.

pip install topica            # once published; pre-built abi3 wheels, no Rust toolchain needed
from topica import LDA

model = LDA(num_topics=2, seed=42)
model.fit([["cat", "dog", "fish"]] * 15 + [["planet", "star", "moon"]] * 15, iterations=1000)

for i, words in enumerate(model.top_words(5)):
    print(f"Topic {i}:", " ".join(w for w, _ in words))

See the getting-started guide and the worked examples for end-to-end analyses.

Models

Model What it's for
LDA Classic topics via fast collapsed-Gibbs (SparseLDA); optional multi-threaded and LightLDA alias samplers
DMR Topics conditioned on document metadata (Dirichlet-multinomial regression)
LabeledLDA Supervised topics tied to document labels
CTM Correlated topics (logistic-normal)
STM The Structural Topic Model: correlated topics with prevalence and content covariates
SAGE Content-covariate topics: the same topic worded differently across groups
HDP Nonparametric LDA that infers the number of topics
DTM Dynamic topics that evolve across time slices
SupervisedLDA Topics shaped to predict a per-document response
PT / GSDMM Short-text models for tweets, survey answers, headlines
SeededLDA / KeyATM Guided topics steered by seed words
PA / HLDA Topic hierarchies (Pachinko, nested-CRP)

Every model exposes the same shape: fit(docs, …), then topic_word (φ), doc_topic (θ), top_words(n), transform(new_docs), and save/load. The variational models (CTM/STM/SupervisedLDA/DTM) parallelize across cores while staying bit-for-bit deterministic. Full guide: the models.

Diagnostics & analysis

Model-agnostic: they work on any fitted model's topic_word/doc_topic:

  • Quality: coherence (u_mass, c_v, c_uci, c_npmi; computed in the Rust core), exclusivity, topic_diversity, quality_frontier
  • Labeling: label_topics (prob / FREX / lift / score), frex, relevance, find_thoughts, topic_table, summary
  • Validation: word_intrusion, document_intrusion, bootstrap_stability, search_k
  • Comparison: fighting_words (weighted log-odds) for contrasting corpora
  • stm toolkit: estimate_effect (method of composition, cluster-robust SEs, GLM links), posterior_theta_samples, spline, interaction, one_hot, topic_correlation
  • Preprocessing: tokenize, learn_phrases / apply_phrases, split_documents, the Corpus class

See diagnostics and covariate effects.

Install from source

pip install maturin
git clone https://github.com/nealcaren/topica && cd topica
python -m venv .venv && source .venv/bin/activate
maturin develop --release --features python

Requires numpy >= 1.21. Use --release (the debug build is much slower).

Acknowledgements

Topica stands on a generation of open topic-modeling research and code. The LDA core binds David Mimno's RustMallet and reproduces MALLET's train output bit-for-bit; the other models are Rust ports or reimplementations, validated against their reference implementations:

  • MALLET (McCallum): SparseLDA, DMR, hyperparameter optimization
  • stm (Roberts, Stewart & Tingley): the Structural Topic Model, estimateEffect, searchK, FREX, spectral initialization, method of composition
  • lda-c / ctm-c / dtm and hdp (Blei lab): the CTM, Dynamic Topic Model, and HDP samplers
  • gensim: coherence measures and the LdaSeqModel DTM reference
  • tomotopy (bab2min): API conventions (summary, short-text models)
  • keyATM (Eshima, Imai & Sasaki): keyword-assisted topic models
  • seededlda (Watanabe): seeded LDA
  • LightLDA (Yuan et al.): the alias-table Metropolis-Hastings sampler
  • GSDMM (Yin & Wang 2014): the movie-group-process mixture for short text

Underlying methods are credited to their authors in the documentation and the source. The SparseLDA scheme is Yao, Mimno & McCallum (KDD 2009).

License

Apache-2.0. Builds on RustMallet (Apache-2.0).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topica-0.1.0.tar.gz (4.4 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

topica-0.1.0-cp39-abi3-win_amd64.whl (1.5 MB view details)

Uploaded CPython 3.9+Windows x86-64

topica-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

topica-0.1.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.5 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

topica-0.1.0-cp39-abi3-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

topica-0.1.0-cp39-abi3-macosx_10_12_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file topica-0.1.0.tar.gz.

File metadata

  • Download URL: topica-0.1.0.tar.gz
  • Upload date:
  • Size: 4.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for topica-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8f592f82173d9a796d331fd97a7dcb8e7b80a0486b41730aa1c2314ffdbc8df7
MD5 8b6883aca01ccbfdda97dbeda06391e9
BLAKE2b-256 360b852ae0ea59656ddab34285e27138948aa4a2f9f9112e9bc2a61b25cbd3f9

See more details on using hashes here.

Provenance

The following attestation bundles were made for topica-0.1.0.tar.gz:

Publisher: CI.yml on nealcaren/topica

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file topica-0.1.0-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: topica-0.1.0-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for topica-0.1.0-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 9669247dfcfd98352eff1e280e9eded70a56eb347baaa7c5640223cf3365d472
MD5 6bc81178d5a09e1b89a6d09079865ce6
BLAKE2b-256 c6b6eb1bd3ce3b9a6fdab391afb8de7d32953ce95fcf201ac6ced69879f58bd3

See more details on using hashes here.

Provenance

The following attestation bundles were made for topica-0.1.0-cp39-abi3-win_amd64.whl:

Publisher: CI.yml on nealcaren/topica

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file topica-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for topica-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f8eb93f23f597f1e8e649d402ca1e1fa2aec10df131c30d453a835064eccd044
MD5 5d619fb14bb56d8ae5f4852a7c0c85b1
BLAKE2b-256 f8dff663a552b2be8965d13e642bff997eb566217ab356719ce4e78e56c52b2a

See more details on using hashes here.

Provenance

The following attestation bundles were made for topica-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: CI.yml on nealcaren/topica

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file topica-0.1.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for topica-0.1.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 8bf7b6ef929f8863b19b5f970a3f3d6d66d4ca60f99a0624d06fc680d8cfe60f
MD5 ffdd8d51f264764d81a531f1087b8d85
BLAKE2b-256 ed9a0bbfa7b09084aede3db6ee1b3635a90916a940b79617edd9ad014100221b

See more details on using hashes here.

Provenance

The following attestation bundles were made for topica-0.1.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: CI.yml on nealcaren/topica

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file topica-0.1.0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for topica-0.1.0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c1d61b9d67644836dc2809fbeff853ab758ddacca0f48f61ae9938312d73707f
MD5 4acdfea232b44c9b35547a6589c670a8
BLAKE2b-256 09a5f5b68ff3f457d434f1014a8ce71e67c87e2b8964397ebbdd945261dc6af1

See more details on using hashes here.

Provenance

The following attestation bundles were made for topica-0.1.0-cp39-abi3-macosx_11_0_arm64.whl:

Publisher: CI.yml on nealcaren/topica

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file topica-0.1.0-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for topica-0.1.0-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 3171dcb5545aac5c2e50494e7881dcf0751c8a030a5218f95dfaf399a280d24b
MD5 9390ea03677b0ae231859f4a7e34ea5a
BLAKE2b-256 91e0e91d9532ba069af403d23443ad86d4d9d0d70331e3ddb8c7495f1a6f82e2

See more details on using hashes here.

Provenance

The following attestation bundles were made for topica-0.1.0-cp39-abi3-macosx_10_12_x86_64.whl:

Publisher: CI.yml on nealcaren/topica

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page