Skip to main content

No project description provided

Project description

nobs-canonicalize

The purpose of this library is to reduce development time needed to cluster documents into topics, at least for your prototype.

[!CAUTION] This library is in early development. It is not ready for production use.

The library has been tested on 2,500 sentences. A smell test of 10,000 sentences seems to pass, but of of course, the topic quality will be unknown, so be cautious and evaluate carefully.

The approach here is to use DBSAN clustering algorithm from BERTopic along with OPENAI's o3-mini LLM model to name the clusters and classify outliers.

Motivations

  • Topic modeling is a time-consuming development task. I did not find any tools to help me quickly make quality topics for my prototype. BERTopic library is a great tool, but it is not easy to use with complicated options.
  • OpenAI's cutting-edge o3-mini names clusters well, and reduces outliers better than BERTopic's default method.

Example usage

OpenAI

import os

from dotenv import load_dotenv
from rich import print

from nobs_canonicalize import nobs_canonicalize

load_dotenv()
openai_api_key = os.environ["OPENAI_API_KEY"]

texts = [
    "16/8 fasting",
    "16:8 fasting",
    "24-hour fasting",
    "24-hour one meal a day (OMAD) eating pattern",
    "2:1 ketogenic diet, low-glycemic-index diet",
    "30-day nutrition plan",
    "36-hour fast",
    "4-day fast",
    "40 hour fast, low carb meals",
    "4:3 fasting",
    "5-day fasting-mimicking diet (FMD) program",
    "7 day fast",
    "84-hour fast",
    "90/10 diet",
    "Adjusting macro and micro nutrient intake",
    "Adjusting target macros",
    "Macro and micro nutrient intake",
    "AllerPro formula",
    "Alternate Day Fasting (ADF), One Meal A Day (OMAD)",
    "American cheese",
    "Atkin's diet",
    "Atkins diet",
    "Avoid seed oils",
    "Avoiding seed oils",
    "Limiting seed oils",
    "Limited seed oils and processed foods",
    "Avoiding seed oils and processed foods",
]

clusters = nobs_canonicalize(
    texts=texts,
    openai_api_key=openai_api_key,
    reasoning_effort="low",  # low, medium, high ... slow, slower, slowest
    subject="personal diet intervention outcomes",
)
print(clusters)

Azure OpenAI

import json
import os

from dotenv import load_dotenv

from nobs_canonicalize import nobs_canonicalize_azure, AzureConfig

load_dotenv()

azure_config = AzureConfig(
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    api_version="2024-12-01-preview",
    azure_endpoint="https://your-resource.openai.azure.com/",
    embedding_deployment="text-embedding-3-large",  # default
    llm_deployment="o3-mini",                        # default
)

clusters = nobs_canonicalize_azure(
    texts=texts,
    reasoning_effort="low",
    subject="personal diet intervention outcomes",
    azure_config=azure_config,
)
print(clusters)

Example output

pytest output

What's happening under the hood? The three steps...

This is a opinionated hybrid approach to topic modeling using a combination of embeddings and LLM completions. The embeddings are for clustering and the LLM completions are for naming and outlier classification.

graph TD;
    A[Start] -->|sentences| B{1.Run Bertopic};
    B -->|clusters| C[2.Name clusters];
    C -->|target classifications| D;;
    B -->|outliers| D[3.Classify and merge outliers];

Step 1 - Cluster sentences

Bertopic library clusters using embeddings from a text-embedding-3-large LLM model.

Step 2 - Name clusters

Names are generated by a o3-mini LLM model for the resulting clusters from Step 1.

Step 3 - Re-group outliers

Outlier sentences, those that did not fit into any of the Bertopic clusters from Step 1, are classified by the o3-mini LLM using the resulting cluster names from Step 2.

Install

Pre-requisites

  • python = ">=3.11,<3.15"
pip install nobs-canonicalize

Some BERTopic FAQs

Why does it take so long to import BERTopic?

Pointers for contributing developer

Run a smoke test

git clone git@github.com:borisdev/nobs-canonicalize.git
cd nobs-canonicalize
pip install -e .
# set the OPENAI_API_KEY in the code or as an environment variable
poetry run pytest tests/test_models.py -v  # unit tests, no API key needed
poetry run pytest tests/test_main.py::test_nobs_canonicalize -v  # integration test
# remember it takes a while to import the bertopic library
  • make a tiny PR so I can see how I can help you get started

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nobs_canonicalize-0.5.0.tar.gz (20.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nobs_canonicalize-0.5.0-py3-none-any.whl (22.1 kB view details)

Uploaded Python 3

File details

Details for the file nobs_canonicalize-0.5.0.tar.gz.

File metadata

  • Download URL: nobs_canonicalize-0.5.0.tar.gz
  • Upload date:
  • Size: 20.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.13.8 Darwin/25.2.0

File hashes

Hashes for nobs_canonicalize-0.5.0.tar.gz
Algorithm Hash digest
SHA256 c6076a52b728724d98fc3e421b90bf381048da18e4a38ea8d6cd59b38e5d9808
MD5 f5b14dca40f63dca911b8b6f894c8cf6
BLAKE2b-256 b7373992f9368114d4c4f5f77c6ebaf3eb478374124a55ecfccc3cc83ea2e075

See more details on using hashes here.

File details

Details for the file nobs_canonicalize-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: nobs_canonicalize-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 22.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.13.8 Darwin/25.2.0

File hashes

Hashes for nobs_canonicalize-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7057e95afbb595aef8ed3b160840cce1ac2fecd37de0046bce2fb9ed5b198a6a
MD5 9fee2b1e8be944ddcced7108c1f80005
BLAKE2b-256 a737bf66b9a535b1fec104e34747ca4a09d905040c826a48b6c44b61c54157cb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page