Skip to main content

No project description provided

Project description

BERTopic Easy

Polishing BERTopic output using OpenAI's o3-mini.

Motivations

  • OpenAI's o3-mini names clusters well.
  • OpenAI's o3-mini reduces outliers better than BERTopic's default method.

Example usage

from bertopic_easy.main import bertopic_easy

openai = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
async_openai = AsyncOpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

texts = [
    "16/8 fasting",
    "16:8 fasting",
    "24-hour fasting",
    "24-hour one meal a day (OMAD) eating pattern",
    "2:1 ketogenic diet, low-glycemic-index diet",
    "30-day nutrition plan",
    "36-hour fast",
    "4-day fast",
    "40 hour fast, low carb meals",
    "4:3 fasting",
    "5-day fasting-mimicking diet (FMD) program",
    "7 day fast",
    "84-hour fast",
    "90/10 diet",
    "Adjusting macro and micro nutrient intake",
    "Adjusting target macros",
    "Macro and micro nutrient intake",
    "AllerPro formula",
    "Alternate Day Fasting (ADF), One Meal A Day (OMAD)",
    "American cheese",
    "Atkin's diet",
    "Atkins diet",
    "Avoid seed oils",
    "Avoiding seed oils",
    "Limiting seed oils",
    "Limited seed oils and processed foods",
    "Avoiding seed oils and processed foods",
]


clusters = bertopic_easy(
    texts=texts,
    openai=openai,
    async_openai=async_openai,
    reasoning_effort="low",
    subject="personal diet intervention outcomes",
)
print(clusters)

Example output

pytest output

What's happening under the hood? The three steps...

This is a opinionated hybrid approach to topic modeling using a combination of embeddings and LLM completions. The embeddings are for clustering and the LLM completions are for naming and outlier classification.

graph TD;
    A[Start] -->|sentences| B{1.Run Bertopic};
    B -->|clusters| C[2.Name clusters];
    C -->|target classifications| D;;
    B -->|outliers| D[3.Classify and merge outliers];

Step 1 - Cluster sentences

Bertopic library clusters using embeddings from a text-embedding-3-large LLM model.

Step 2 - Name clusters

Names are generated by a o3-mini LLM model for the resulting clusters from Step 1.

Step 3 - Re-group outliers (not implemented yet)

Outlier sentences, those that did not fit into any of the Bertopic clusters from Step 1, are classified by the o3-mini LLM using the resulting cluster names from Step 2.

Install

  • git clone this repo
  • cd to the root of the repo
  • set OPENAI_API_KEY as an environment variable or in a .env local file
  • poetry install
  • poetry shell # to activate the virtual environment, if needed
  • poetry run python demo.py

Run smoke test

poetry run pytest tests/test_main.py::test_bertopic_easy

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bertopic_easy-0.1.0.tar.gz (18.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bertopic_easy-0.1.0-py3-none-any.whl (20.3 kB view details)

Uploaded Python 3

File details

Details for the file bertopic_easy-0.1.0.tar.gz.

File metadata

  • Download URL: bertopic_easy-0.1.0.tar.gz
  • Upload date:
  • Size: 18.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for bertopic_easy-0.1.0.tar.gz
Algorithm Hash digest
SHA256 064902c36629a45a8c10f843a70fbd45a4fdaa092b2fdf81931e6481b472b454
MD5 4c00eba6641c4289de0425e9c3534b1b
BLAKE2b-256 088d62bfcc5ebe4fb4249d273a030e371080e629c2855af320b1776bb09cae23

See more details on using hashes here.

File details

Details for the file bertopic_easy-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: bertopic_easy-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 20.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for bertopic_easy-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 032c6883cb40a9f44d07169881bfc607f16096f9890891cf50b7d10f19fac63d
MD5 994aa981caf0c02c59f34826552c5d4f
BLAKE2b-256 1d56c3b531b79c17061a7372d09461244ce8dfb01bc8bd95d29ddb3094901454

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page