Skip to main content

No project description provided

Project description

BERTopic Easy

Polishing BERTopic output using OpenAI's o3-mini.

Motivations

  • OpenAI's o3-mini names clusters well.
  • OpenAI's o3-mini reduces outliers better than BERTopic's default method.

Example usage

import os

from dotenv import load_dotenv
from openai import AsyncOpenAI, OpenAI
from rich import print

from bertopic_easy.main import bertopic_easy

load_dotenv()

texts = [
    "16/8 fasting",
    "16:8 fasting",
    "24-hour fasting",
    "24-hour one meal a day (OMAD) eating pattern",
    "2:1 ketogenic diet, low-glycemic-index diet",
    "30-day nutrition plan",
    "36-hour fast",
    "4-day fast",
    "40 hour fast, low carb meals",
    "4:3 fasting",
    "5-day fasting-mimicking diet (FMD) program",
    "7 day fast",
    "84-hour fast",
    "90/10 diet",
    "Adjusting macro and micro nutrient intake",
    "Adjusting target macros",
    "Macro and micro nutrient intake",
    "AllerPro formula",
    "Alternate Day Fasting (ADF), One Meal A Day (OMAD)",
    "American cheese",
    "Atkin's diet",
    "Atkins diet",
    "Avoid seed oils",
    "Avoiding seed oils",
    "Limiting seed oils",
    "Limited seed oils and processed foods",
    "Avoiding seed oils and processed foods",
]

clusters = bertopic_easy(
    texts=texts,
    openai=OpenAI(api_key=os.environ.get("OPENAI_API_KEY")),
    async_openai=AsyncOpenAI(api_key=os.environ.get("OPENAI_API_KEY")),
    reasoning_effort="low",
    subject="personal diet intervention outcomes",
)
print(clusters)

Example output

pytest output

What's happening under the hood? The three steps...

This is a opinionated hybrid approach to topic modeling using a combination of embeddings and LLM completions. The embeddings are for clustering and the LLM completions are for naming and outlier classification.

graph TD;
    A[Start] -->|sentences| B{1.Run Bertopic};
    B -->|clusters| C[2.Name clusters];
    C -->|target classifications| D;;
    B -->|outliers| D[3.Classify and merge outliers];

Step 1 - Cluster sentences

Bertopic library clusters using embeddings from a text-embedding-3-large LLM model.

Step 2 - Name clusters

Names are generated by a o3-mini LLM model for the resulting clusters from Step 1.

Step 3 - Re-group outliers (not implemented yet)

Outlier sentences, those that did not fit into any of the Bertopic clusters from Step 1, are classified by the o3-mini LLM using the resulting cluster names from Step 2.

Install

Pre-requisites

  • python = ">=3.11,<3.13"
pip install bertopic-easy

Run smoke test

poetry run pytest tests/test_main.py::test_bertopic_easy

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bertopic_easy-0.2.0.tar.gz (18.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bertopic_easy-0.2.0-py3-none-any.whl (20.3 kB view details)

Uploaded Python 3

File details

Details for the file bertopic_easy-0.2.0.tar.gz.

File metadata

  • Download URL: bertopic_easy-0.2.0.tar.gz
  • Upload date:
  • Size: 18.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for bertopic_easy-0.2.0.tar.gz
Algorithm Hash digest
SHA256 b5832456a9b76546061769b3f0287c992dc70b7c539695472f8a64c21f579f00
MD5 5d397f3b3e1f7a7c50d0f67a85edc088
BLAKE2b-256 c910f6073c3d5c75d3fe1e94c94d944771a763d14abe8769d7ecddf0eba266da

See more details on using hashes here.

File details

Details for the file bertopic_easy-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: bertopic_easy-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 20.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for bertopic_easy-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cae1841183e20703967edeb582fcb307f528a515182eea11224207ae7223dd8e
MD5 2d807a35cb5c7ca995657d11428f7518
BLAKE2b-256 5bf32f96eaa2c07d7d92b8a1e60f99a140326665b8e5c3a857d30ee58dfd3a62

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page