No project description provided
Project description
BERTopic Easy
Polishing BERTopic output using OpenAI's o3-mini.
Motivations
- OpenAI's
o3-mininames clusters well. - OpenAI's
o3-minireduces outliers better than BERTopic's default method.
Example usage
from bertopic_easy.main import bertopic_easy
openai = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
async_openai = AsyncOpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
texts = [
"16/8 fasting",
"16:8 fasting",
"24-hour fasting",
"24-hour one meal a day (OMAD) eating pattern",
"2:1 ketogenic diet, low-glycemic-index diet",
"30-day nutrition plan",
"36-hour fast",
"4-day fast",
"40 hour fast, low carb meals",
"4:3 fasting",
"5-day fasting-mimicking diet (FMD) program",
"7 day fast",
"84-hour fast",
"90/10 diet",
"Adjusting macro and micro nutrient intake",
"Adjusting target macros",
"Macro and micro nutrient intake",
"AllerPro formula",
"Alternate Day Fasting (ADF), One Meal A Day (OMAD)",
"American cheese",
"Atkin's diet",
"Atkins diet",
"Avoid seed oils",
"Avoiding seed oils",
"Limiting seed oils",
"Limited seed oils and processed foods",
"Avoiding seed oils and processed foods",
]
clusters = bertopic_easy(
texts=texts,
openai=openai,
async_openai=async_openai,
reasoning_effort="low",
subject="personal diet intervention outcomes",
)
print(clusters)
Example output
What's happening under the hood? The three steps...
This is a opinionated hybrid approach to topic modeling using a combination of embeddings and LLM completions. The embeddings are for clustering and the LLM completions are for naming and outlier classification.
graph TD;
A[Start] -->|sentences| B{1.Run Bertopic};
B -->|clusters| C[2.Name clusters];
C -->|target classifications| D;;
B -->|outliers| D[3.Classify and merge outliers];
Step 1 - Cluster sentences
Bertopic library clusters using embeddings from a text-embedding-3-large LLM model.
Step 2 - Name clusters
Names are generated by a o3-mini LLM model for the resulting clusters from Step 1.
Step 3 - Re-group outliers (not implemented yet)
Outlier sentences, those that did not fit into any of the Bertopic clusters
from Step 1, are classified by the o3-mini LLM using the resulting
cluster names from Step 2.
Install
git clonethis repocdto the root of the repo- set
OPENAI_API_KEYas an environment variable or in a.envlocal file poetry installpoetry shell# to activate the virtual environment, if neededpoetry run python demo.py
Run smoke test
poetry run pytest tests/test_main.py::test_bertopic_easy
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bertopic_easy-0.1.0.tar.gz.
File metadata
- Download URL: bertopic_easy-0.1.0.tar.gz
- Upload date:
- Size: 18.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
064902c36629a45a8c10f843a70fbd45a4fdaa092b2fdf81931e6481b472b454
|
|
| MD5 |
4c00eba6641c4289de0425e9c3534b1b
|
|
| BLAKE2b-256 |
088d62bfcc5ebe4fb4249d273a030e371080e629c2855af320b1776bb09cae23
|
File details
Details for the file bertopic_easy-0.1.0-py3-none-any.whl.
File metadata
- Download URL: bertopic_easy-0.1.0-py3-none-any.whl
- Upload date:
- Size: 20.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
032c6883cb40a9f44d07169881bfc607f16096f9890891cf50b7d10f19fac63d
|
|
| MD5 |
994aa981caf0c02c59f34826552c5d4f
|
|
| BLAKE2b-256 |
1d56c3b531b79c17061a7372d09461244ce8dfb01bc8bd95d29ddb3094901454
|