SeededPF is a seed guided topic model based on Poisson factorization.
Project description
SeededPF
What is seededPF
seededPF is an easy to use implementation of the Seeded Poisson Factorization (SPF) topic model, introduced in this research paper. SPF provides a guided topic modeling approach that allows users to pre-specify topics of interest by providing sets of seed words. Built on Poisson factorization, it leverages variational inference techniques for efficient and scalable estimation.
Traditional unsupervised topic models often struggle to align with predefined conceptual domains and typically require significant post-processing efforts, such as topic merging or manual labeling, to ensure topic coherence. seededPF overcomes this limitation by enabling the pre-specification of topics, which leads to improved topic interpretability and reduces the need for manual post-processing. Additionally, it supports the estimation of unsupervised topics when no seed words are provided.
Consider using seededPF if:
- You need to fit a topic model with a specific topic schema.
- You wish to estimate a topic model that is partially or fully unsupervised (i.e., providing no seed words means fitting a standard Poisson factorization topic model without predefined topics).
- You require a fast and scalable topic modeling solution.
seededPF offers a high-performance, scalable interface for guided topic modeling, providing a reliable alternative to keyATM and SeededLDA, while minimizing the need for manual intervention and enhancing topic interpretability.
Installation
seededPF works with Python 3.10 or Python 3.11. The main dependencies are Tensorflow 2.18 and tensorflow_probability 0.25.
Please be sure to adjust the dependencies if you are able to accelerate GPU support.
Via pip
The easiest way to install seededPF is via pip.
pip install seededpf
From source
One can also install the package from GitHub. Configure a virtual environment using Pyhton 3.10 or Python 3.11. Inside the virtual environment, use pip to install the required packages:
(venv)$ pip install -r requirements.txt
Training the Seeded Poisson Factorization model
seededPF is an easy to use library for topic modeling. We quickly walk through the most essential steps below:
- Imports and data preparation
- Initialization
- Reading documents
- Training the model
- Post-hoc analysis
The following minimal example is available on GitHub.
Step 1: Imports and data preparation
Once installed, one can import the SPF class of the seededPF library and is ready to go. There are only 2 things required to fit the SPF topic model:
- Text documents
- A seed word (i.e., keyword) dictionary for each topic to be estimated.
# Imports
from seededpf import SPF
from sklearn.feature_extraction.text import CountVectorizer
# Example documents - customer reviews about either smartphones or computers
documents = [
"My smartphone's battery life is fantastic, lasts all day!",
"The camera on my phone is incredible, takes crystal-clear photos.",
"Love the smooth performance, but it overheats with heavy apps.",
"This phone charges super fast, very convenient.",
"Upgraded my PC and it boots in seconds!",
"Great for gaming, but gets hot after long sessions.",
"My computer sometimes freezes, but a restart fixes it.",
"Best laptop I’ve owned, powerful and reliable!"
]
# Define topic-specific seed words
smartphone = {"smartphone", "iphone", "phone", "touch", "app"}
pc = {"laptop", "keyboard", "desktop", "pc"}
keywords = {"smartphone": smartphone, "pc": pc}
Step 2: Initialization
Now that we have both the documents and the pre-specification of topics to be estimated, we can initialize the SPF topic model.
spf = SPF(keywords = keywords, residual_topics = 0) # Fits 2 seeded topics and 0 unsupervised topics
Step 3: Reading documents
We tokenize the documents and create all data required for model training automatically.
spf.read_docs(documents,
count_vectorizer=CountVectorizer(stop_words="english", min_df = 0),
batch_size = 1024)
Step 4: Training the model
For model training, we have to set the learning rate and the number of epochs.
spf.model_train(lr = 0.1, epochs = 150)
Step 5: Analysis of the results
There are different methods available to analyze the topic model results. We refer to the minimal example or advanced example where we show post-hoc analysis methods.
The seededPF package offers several methods, including:
SPF.plot_model_loss(): Checks convergence of the negative ELBO.SPF.return_topics(): Returns a tuple (categories, E_theta), with categories being the most probable topic for each document and E_theta being the approximate posterior mean estimates per document and topic.SPF.calculate_topic_word_distributions(): Returns a pandas dataframe containing the approximate topic-term mean intensities.SPF.print_topics(): Returns a dictionary with the highest intensity words per topic.SPF.plot_seeded_topic_distribution(): Plots the variational topic word distribution of all seed words belonging to the topic parameter.SPF.plot_word_distribution(): Shows the fitted variational distribution of q(\Tilde{\beta}){topic,word} and q(\beta^*)_{topic,word}.
Contribution
If you encounter any bugs or would like to suggest new features for the library, please feel free to contact us or create an issue.
Citing
When citing seededPF, please use this BibTeX entry:
@misc{prostmaier2025seededpoissonfactorizationleveraging,
title={Seeded Poisson Factorization: Leveraging domain knowledge to fit topic models},
author={Bernd Prostmaier and Jan Vávra and Bettina Grün and Paul Hofmarcher},
year={2025},
eprint={2503.02741},
archivePrefix={arXiv},
primaryClass={stat.ME},
url={https://arxiv.org/abs/2503.02741},
}
License
Code licensed under MIT.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file seededpf-0.1.0.tar.gz.
File metadata
- Download URL: seededpf-0.1.0.tar.gz
- Upload date:
- Size: 50.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
af6be84065597bf1c6c9873c35b3a399e57944dc296755e30dc86aa95e4dba2a
|
|
| MD5 |
0242691cb5cccc6584df901171be49d7
|
|
| BLAKE2b-256 |
f74a3177476c9263294121b48eef31846d6eeaee6449d3a33b1c8ee278e4d6d2
|
Provenance
The following attestation bundles were made for seededpf-0.1.0.tar.gz:
Publisher:
release.yaml on BPro2410/Seeded-Poisson-Factorization
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
seededpf-0.1.0.tar.gz -
Subject digest:
af6be84065597bf1c6c9873c35b3a399e57944dc296755e30dc86aa95e4dba2a - Sigstore transparency entry: 179771100
- Sigstore integration time:
-
Permalink:
BPro2410/Seeded-Poisson-Factorization@7bf49ba854599bf4938809a3082835bc3fea7561 -
Branch / Tag:
refs/tags/0.1.0 - Owner: https://github.com/BPro2410
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@7bf49ba854599bf4938809a3082835bc3fea7561 -
Trigger Event:
push
-
Statement type:
File details
Details for the file seededpf-0.1.0-py3-none-any.whl.
File metadata
- Download URL: seededpf-0.1.0-py3-none-any.whl
- Upload date:
- Size: 47.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7a850619c4535cf93168d2b35c722c1ec816fa70c70bf43873bd557b243dde7d
|
|
| MD5 |
9f0b2138e00afaa46806935f80eadb36
|
|
| BLAKE2b-256 |
519011a39467235ba7205afa95676312a46943c79c85e2f23069a2555285e754
|
Provenance
The following attestation bundles were made for seededpf-0.1.0-py3-none-any.whl:
Publisher:
release.yaml on BPro2410/Seeded-Poisson-Factorization
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
seededpf-0.1.0-py3-none-any.whl -
Subject digest:
7a850619c4535cf93168d2b35c722c1ec816fa70c70bf43873bd557b243dde7d - Sigstore transparency entry: 179771102
- Sigstore integration time:
-
Permalink:
BPro2410/Seeded-Poisson-Factorization@7bf49ba854599bf4938809a3082835bc3fea7561 -
Branch / Tag:
refs/tags/0.1.0 - Owner: https://github.com/BPro2410
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@7bf49ba854599bf4938809a3082835bc3fea7561 -
Trigger Event:
push
-
Statement type: