Skip to main content

Semantic document clustering and topic labeling

Project description

topicmodel

topicmodel lets you discover what topics are covered in a bunch of documents. You can also classify documents into topics and find the similarity of each document with each topic.

Usage

To categorize each line in docs.txt into topics, run:

export OPENAI_API_KEY=...
uvx topicmodel docs.txt --output topicmodel.txt

Discover Topics

For example, if docs.txt has:

Mars has a thin atmosphere.
The moon orbits Earth.
Stars shine at night.
Bread needs yeast.
Basil smells fresh.

Run:

uvx topicmodel docs.txt --ntopics=2

It groups each line into 2 auto-discovered topics and print something like:

1: Space and Astronomy 2: Food and Ingredients

text best_match best_score Space and Astronomy Food and Ingredients
Mars has a thin atmosphere. Space and Astronomy 0.28224 0.28224 0.06313
The moon orbits Earth. Space and Astronomy 0.26560 0.26560 0.00546
Stars shine at night. Space and Astronomy 0.32462 0.32462 0.04896
Bread needs yeast. Food and Ingredients 0.28357 0.02198 0.28357
Basil smells fresh. Food and Ingredients 0.20560 0.06859 0.20560

The best_match column is the closest topic to the text. The rest of the columns are the similarity between the text and each topic.

Use Existing Topics

Create this topics.txt:

Astronomy
Cooking

Run:

uvx topicmodel docs.txt --topics topics.txt

This groups each line into the 2 topics in topics.txt along with the similarities:

text best_match Astronomy Cooking
Mars has a thin atmosphere. Astronomy 0.17034 0.03036
The moon orbits Earth. Astronomy 0.29521 0.01998
Stars shine at night. Astronomy 0.28186 0.12287
Bread needs yeast. Cooking 0.03838 0.18655
Basil smells fresh. Cooking 0.05344 0.16860

Options

  • --docs: File containing documents. Required. Can be .txt, .csv or .json file or a JSON string
    • .txt: Each line is treated as a document.
    • .csv: Each row is treated as a document. Only the first column is used.
    • .json: This should have an array of objects. Only the first key is used. Example: [{"text": "Apples are great"}, {"text": "Bananas are yellow"}]
    • JSON string: You can pass the the JSON directly as input. Example: uvx topicmodel '[{"text": "Apple"}, {"text": "Banana"}]' --ntopics 2
  • --topics: Optional file with existing topics you want to match with. Can be .txt, .csv or .json
  • --output: Path to save results. Can be .csv, .json or .txt.
  • --model: Default: text-embedding-3-small. OpenAI embedding model. Use text-embedding-3-large for higher quality.
  • --name_model: Default: gpt-4.1-mini. Model to name clusters.
  • --ntopics: Default: 20. Approx. number of topics to auto-discover. Increase for more granular clusters.
  • --nsamples: Default: 5. Documents to show the naming model from each cluster. Higher values may improve topic names but increase cost.
  • --truncate: Default: 200. Characters from each document to send to the naming model. Adjust based on document length; shorter saves tokens.
  • --prompt: Prompt sent to the naming model. Modify to control naming style.

The default --prompt is:

Here are clusters of documents. Suggest 2-4 word topic names for each cluster. Capture the spirit of each cluster. Differentiate from other clusters.

Environment variables:

# Use a different OpenAI compatible provider, e.g. openrouter:
export OPENAI_BASE_URL=https://openrouter.ai/api/v1

# Embeddings are cached in this path. You can change it. The default is:
export TOPICMODEL_CACHE=~/.cache/topicmodel/embeddings.db

Development

git clone https://github.com/gramener/topicmodel.git
cd topicmodel
uvx ruff --line-length 100 .
uvx --with pytest-asyncio,httpx,pandas,numpy,scikit-learn,tiktoken,tqdm pytest

Deployment

Modify the pyproject.toml file to change the version number.

uv build
uv publish

This is deployed to pypi as Anand.S

Change log

  • 0.1.2: 07 Aug 2025. Include best_score in output
  • 0.1.1: 07 Aug 2025. Help shows defaults. Informative errors. More tests
  • 0.1.0: 25 Jul 2025. Initial release

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topicmodel-0.1.2.tar.gz (8.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

topicmodel-0.1.2-py3-none-any.whl (9.7 kB view details)

Uploaded Python 3

File details

Details for the file topicmodel-0.1.2.tar.gz.

File metadata

  • Download URL: topicmodel-0.1.2.tar.gz
  • Upload date:
  • Size: 8.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.3

File hashes

Hashes for topicmodel-0.1.2.tar.gz
Algorithm Hash digest
SHA256 f8521874d9faa609e53dd99f61ee7d54b9f36466596ff1aedbf6eeb0680884e6
MD5 688eb9fe70e75e470d2801104e40dc6a
BLAKE2b-256 726d6f27d9dae5f7905e9cdbdb4a522df4371fb88b65237e371af861050a2e20

See more details on using hashes here.

File details

Details for the file topicmodel-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for topicmodel-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 cbfd77bb7073c196506d5a73b44bcdfff79ab009e36a7315ca65d51a6a18ed3e
MD5 11bf372f314af832fd83764f1aedbf00
BLAKE2b-256 d29807dccefd1f7added420fef6c76664c8a351da0c6a4b69d1ad865c3c16469

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page