Skip to main content

Fast LDA topic modeling — Python bindings for RustMallet

Project description

pyrmallet

Python bindings for RustMallet — a fast Rust implementation of the sparse Gibbs sampling LDA algorithm from MALLET, following the SparseLDA scheme of Yao, Mimno and McCallum (KDD 2009).

Built with PyO3 and maturin. There are two layers: a sklearn-compatible LatentDirichletAllocation class and a lower-level _rust_mallet extension module.

Install

pip install pyrmallet

sklearn-compatible API

LatentDirichletAllocation follows the scikit-learn estimator interface. It takes a list of raw text strings — tokenization and vocabulary building happen inside Rust.

from pyrmallet import LatentDirichletAllocation

docs = ["the quick brown fox ...", "machine learning models ...", ...]

lda = LatentDirichletAllocation(n_components=20, max_iter=1000)
lda.fit(docs)

lda.components_               # ndarray [n_topics, n_vocab], rows sum to ~1
lda.doc_topic_distributions_  # ndarray [n_docs, n_topics]
lda.feature_names_in_         # vocabulary array
lda.n_features_in_            # vocabulary size

fit_transform() is also available and returns doc_topic_distributions_ directly.

Inferring topic distributions for new documents

After fit(), call transform() with any list of raw text strings. Tokens not seen during training are silently ignored.

new_docs = ["natural language processing tasks ...", "deep reinforcement learning ..."]
theta = lda.transform(new_docs)  # ndarray [n_new_docs, n_topics]

The number of Gibbs iterations used for inference is controlled by n_inference_iter (default 50).

Constructor parameters

Parameter Default Description
n_components 10 Number of topics
max_iter 1000 Gibbs sampling iterations
burn_in 200 Iterations before hyperparameter optimization
optimize_interval 50 Optimize alpha/beta every N iterations; 0 to disable
num_samples 5 Samples averaged for final estimates
sample_interval 25 Iterations between samples
doc_topic_prior n_components Initial symmetric alpha sum
topic_word_prior 0.01 Initial beta per word
random_state 42 Random seed
n_inference_iter 50 Gibbs iterations per document during transform()
stopwords None List of words to exclude, or path to a stoplist file
min_doc_freq 1 Drop words appearing in fewer than N documents
max_doc_fraction 1.0 Drop words appearing in more than this fraction of documents
verbose False Print log-likelihood progress during training

Low-level API

pyrmallet._rust_mallet exposes Corpus and TopicModel objects directly.

from pyrmallet import _rust_mallet as rm

# Build a corpus directly from strings (no file I/O)
stopwords = rm.load_stopwords("examples/english-stoplist.txt")
corpus = rm.Corpus.from_strings(
    docs,
    stopwords=stopwords,
    min_doc_freq=2,
)

# Or load from a file
corpus = rm.Corpus.from_text_file("docs.txt", stopwords=stopwords)
corpus = rm.Corpus.from_tsv_file(
    "docs.tsv", id_column=0, text_column=1,
    stopwords=stopwords,
)

# Save/load a preprocessed corpus
corpus.save("corpus.corp")
corpus = rm.Corpus.load("corpus.corp")

# Train
model = rm.train(corpus, num_topics=20, iterations=1000, verbose=True)

# Inspect results
model.top_words(n=10)       # List[List[str]], one word list per topic
model.topic_word_matrix()   # List[List[float]], shape [num_topics][num_types]
model.doc_topic_matrix()    # List[List[float]], shape [num_docs][num_topics]
model.log_likelihood(corpus)

# Infer topic distributions for new raw-text documents (fixed-phi Gibbs)
theta = model.infer_strings(new_docs, n_iter=50)  # List[List[float]], shape [n_docs][num_topics]

# Or infer from a pre-built count matrix (columns indexed by training vocabulary)
theta = model.infer(count_matrix, n_iter=50)      # List[List[float]]

Building from source

Requires uv and a Rust toolchain. From the repo root:

PATH="$HOME/.cargo/bin:$PATH" uv run --with maturin maturin develop

See the RustMallet README for the full project, including the standalone CLI tools.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyrmallet-0.1.1.tar.gz (2.3 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pyrmallet-0.1.1-cp39-abi3-win_amd64.whl (836.3 kB view details)

Uploaded CPython 3.9+Windows x86-64

pyrmallet-0.1.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

pyrmallet-0.1.1-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

pyrmallet-0.1.1-cp39-abi3-macosx_11_0_arm64.whl (937.3 kB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

pyrmallet-0.1.1-cp39-abi3-macosx_10_12_x86_64.whl (973.8 kB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file pyrmallet-0.1.1.tar.gz.

File metadata

  • Download URL: pyrmallet-0.1.1.tar.gz
  • Upload date:
  • Size: 2.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyrmallet-0.1.1.tar.gz
Algorithm Hash digest
SHA256 b8d7cf388f3890f0ecbb745b37e0585c71e773b1a75806ac665b718b7e8a6de1
MD5 00ef79a6554bfffe603e6f535fbe893b
BLAKE2b-256 93ef51b4e6db269aafe20de190c8ab09f222ecd4603b6c08ba5b2d96eb3b873d

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyrmallet-0.1.1.tar.gz:

Publisher: pypi.yml on mimno/RustMallet

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyrmallet-0.1.1-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: pyrmallet-0.1.1-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 836.3 kB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyrmallet-0.1.1-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 796c5e21b5a5e452e072203983eba22df9b624573f8fb3279acb495aa46c6ec1
MD5 f4f85761a13baeb102ebf1dd78934c8a
BLAKE2b-256 0e54ce38e827565a1dd5b645a596e32dacea0b418f1f127efaaec788a40365b4

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyrmallet-0.1.1-cp39-abi3-win_amd64.whl:

Publisher: pypi.yml on mimno/RustMallet

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyrmallet-0.1.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyrmallet-0.1.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 84435bc7b23487815df963e6a612746f81d617192369b70fc34f6dc756f03471
MD5 85cba2025d15cfc7db9c24d847e51805
BLAKE2b-256 c95612420da916acb2e30e8d7d87ad784d07e8460f9472f5e8d3d1e89dbd9125

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyrmallet-0.1.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: pypi.yml on mimno/RustMallet

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyrmallet-0.1.1-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for pyrmallet-0.1.1-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 3e788ae437abf105b0a6eb79c1869143157b5c9c6a3f83e33c70ecb161212c72
MD5 27ffae8f421826cc43596175aae6326f
BLAKE2b-256 28734b652f1b8f841e5ad3751a27d5f749ac37b600a4c2e0d6e06c95280f121b

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyrmallet-0.1.1-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: pypi.yml on mimno/RustMallet

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyrmallet-0.1.1-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pyrmallet-0.1.1-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 42e3cd9ff8dc25e4863f4d699ccb3862eeca6c140281c58030da5df018eb9131
MD5 19e6e0bef4d52c918916783b922e6b49
BLAKE2b-256 d9c191f2a4ca959877cf595713527be5e6f321b8f5399e756e7a36f2b3053dac

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyrmallet-0.1.1-cp39-abi3-macosx_11_0_arm64.whl:

Publisher: pypi.yml on mimno/RustMallet

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyrmallet-0.1.1-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for pyrmallet-0.1.1-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 25e870977c1c15098e43585ae9e167931ec68dce696791d67fc0606ad2a36324
MD5 decb7d9a1b86a1d32d9779eea125afa7
BLAKE2b-256 5fa537db80aa821d421d96c62cc8850f669c0ccfadbf58c64da3a15d4b9b5c41

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyrmallet-0.1.1-cp39-abi3-macosx_10_12_x86_64.whl:

Publisher: pypi.yml on mimno/RustMallet

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page