Skip to main content

A modular topic modeling toolkit with classical and neural models.

Project description

topicnova (import: topicnova)

topicnova is the PyPI distribution name. Python imports remain topicnova.

topicnova provides:

  • Classical models: LDA, BERTopic
  • Neural models: configurable VAE and SCHOLAR variants
  • Data utilities: dataset loading, tokenization, dataloaders
  • Evaluation utilities: coherence and topic diversity

Install

From PyPI (after publish)

pip install topicnova

Local development with uv

uv sync --group dev
uv run python -m spacy download en_core_web_sm

Local-only mode (no wandb)

Set wandb=False (default). In this mode:

  • no wandb session is initialized
  • no network logging is attempted
  • all artifacts stay under your local exp_path

Quickstart

from topicnova.run import run

model, summary = run(
    wandb=False,
    project_name="topic-models",
    wandb_path="./runs",
    dataset_name="20ng",
    data_path="./datasets",
    remove_labels=False,
    tpl=1,
    min_df=30,
    max_df=0.85,
    exp_path="./runs/exp1",
    device="cpu",
    model_name="lda",
    num_topics=20,
    sentence_transformer_name="all-MiniLM-L6-v2",
    doc_emb_dim=384,
    eps=1e-8,
    beta=2.0,
    alpha=0.01,
    batch_size=64,
    lr=1e-3,
    epochs=10,
    random_state=0,
    saved_data=None,
)
print(summary)

Auto mode (minimal input)

You can now call run() with only dataset information. The library will:

  • infer preprocessing thresholds (min_df, max_df) from dataset stats
  • default to VAE-ECRTM (vae-...-ecrtm-lin-dir_rsvi-etm) and include labels/authors conditioning when metadata is available
  • choose defaults for num_topics, epochs, batch_size, lr
  • select best device automatically (cuda:0 > mps > cpu) when device is omitted or set to "auto"
from topicnova.run import run

model, summary = run(
    dataset_name="20ng",
    data_path="./datasets",
    exp_path="./runs/auto_20ng",
)

Config-based runs

uv run python experiments/run_from_template.py config/template.yaml

Important config keys:

  • dataset_name: e.g. 20ng, ag_news, dbpedia, self, arxiv_cs
  • data_path: directory where cached/loaded datasets are stored
  • model_name: model selector (examples below)
  • num_topics: set to null to infer from labels where applicable
  • Performance:
    • amp, compile_model, compile_mode
    • num_workers, pin_memory, persistent_workers, prefetch_factor
    • matmul_precision, cudnn_benchmark
    • early_stopping, early_stopping_patience, early_stopping_min_delta

Hyperparameter tuning (random search)

uv run python experiments/tune_random_search.py config/template.yaml --trials 12 --metric auto

Outputs:

  • per-trial logs in <exp_path>/trial_*
  • consolidated results in <exp_path>/tuning_results.json

Interactive topic visualization

Option A: generate during training

Set visualize: true in config (or pass visualize=True to run). This writes:

  • <exp_path>/topics_interactive.html

If model outputs include authors/labels, the graph links:

  • topic -> words
  • topic -> authors (author-aware models)
  • topic -> labels (SCHOLAR)
  • label -> authors (when both are available)

Clicking a topic node opens a details panel with:

  • top words
  • linked label/authors
  • per-topic quantitative metrics (c_v, c_npmi) when available

Option B: load an existing experiment

from topicnova import visualize_experiment

fig = visualize_experiment("./runs/exp1", notebook=True)

Supported model names

  • LDA: lda
  • BERTopic: bertopic
  • VAE family: vae-<flags>-<encoder>-<sampler>-<decoder>
  • SCHOLAR family: scholar-<flags>-lin-<sampler>-<decoder>

Common tokens:

  • Flags: labels, authors, ecrtm (optional)
  • Encoder: lin, context, llm (VAE); lin (SCHOLAR)
  • Sampler: dir_pathwise, dir_rsvi
  • Decoder: lin, etm

Examples:

  • vae-lin-dir_pathwise-etm
  • vae-context-dir_rsvi-lin
  • scholar-labels-lin-dir_rsvi-lin

Custom datasets (dataset_name: self)

Place files in data_path:

  • train.csv
  • val.csv
  • test.csv

Required columns:

  • text
  • label (optional, list-like string)

Optional:

  • author

Development

uv sync --group dev
uv run pytest -q
uv run ruff check .

Build and publish

uv sync --group dev
uv run python -m build
uv run twine check dist/*
uv run twine upload dist/*

Recommended first release command sequence:

uv sync --group dev
uv run pytest -q
rm -rf dist
uv run python -m build
uv run twine check dist/*
uv run twine upload dist/topicnova-*.whl dist/topicnova-*.tar.gz

Note: as of February 9, 2026, tomo on PyPI exists, which is why this project publishes as topicnova.

See RELEASING.md for a release checklist.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topicnova-0.3.0.tar.gz (320.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

topicnova-0.3.0-py3-none-any.whl (43.9 kB view details)

Uploaded Python 3

File details

Details for the file topicnova-0.3.0.tar.gz.

File metadata

  • Download URL: topicnova-0.3.0.tar.gz
  • Upload date:
  • Size: 320.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for topicnova-0.3.0.tar.gz
Algorithm Hash digest
SHA256 ab5c7d8c8caa0f69921725b53cd39d0ac60a6da6fdef97f7116407966c3bca0e
MD5 a561cb0c77a58c0e5e367947c0832259
BLAKE2b-256 68208fbff8500918b2fb812139e170c6ec5df7309ff16ca6333598d0488a6fc4

See more details on using hashes here.

File details

Details for the file topicnova-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: topicnova-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 43.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for topicnova-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 83e01fa3ef7256c5bc9e43fa9d0c9bdf9ed55a813c31788244f650e77ccfa227
MD5 7ec5c64dc2c7d7eb2495abd7ccf01d03
BLAKE2b-256 147022750b94f9a8855c79cc97e390a26e8cb4fea1c93fe1c4739c5d73197bdb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page