Skip to main content

Using AI to sort lists of unstructured text through iterative batches.

Project description

assort

Text clustering and sorting with an LLM that discovers categories, classifies items, optionally merges overlapping themes, and cleans up labels. Designed for quick drop in use with a single function call and clear cost tracking.

Highlights

  • Discovers category names from your data
  • Sorts each item with calibrated confidence per category
  • Merges overlapping themes when the model judges a high likelihood of overlap
  • Refines the miscellaneous bucket when it is too large
  • Optionally renames categories to be clearer and more specific
  • Tracks tokens and estimated cost in USD
  • Simple one function API that returns results and rich stats

Install

pip install assort

You also need an OpenAI API key available to the runtime, for example

export OPENAI_API_KEY=sk_your_key_here

Quick start

from assort import assort

texts = [
    "Build a responsive landing page in React",
    "How to index a Postgres table",
    "Cognitive behavioral therapy exercises",
    "Vector search with Azure AI Search",
    "Tailwind utility classes for layouts",
    "Managing anxiety before a big presentation",
]

results, stats = assort(
    texts,
    min_clusters=3,
    max_clusters=6,
    description="Short notes that mix software topics and mental health topics",
)

print(results["sorted_results"])
print(round(stats["cost_usd"], 4), "USD")

Example shape of sorted_results

{
    "Front End Engineering": [
        "Build a responsive landing page in React",
        "Tailwind utility classes for layouts"
    ],
    "Data and Search": [
        "Vector search with Azure AI Search",
        "How to index a Postgres table"
    ],
    "Anxiety and CBT": [
        "Cognitive behavioral therapy exercises",
        "Managing anxiety before a big presentation"
    ],
    "Miscellaneous": []
}

API

assort

results, stats = assort(
    batch,
    min_clusters=2,
    max_clusters=5,
    policy=None,
    description="",
    print_estimate=False,
    confirm=False,
    max_budget=None,
    model=None,
    rename_final=True,
)

Parameters

  • batch List of strings to categorize. Empty or blank strings are ignored.

  • min_clusters and max_clusters Bounds for initial category discovery.

  • policy Policy.fuzzy or Policy.exhaustive. This affects internal cost estimation. Both modes perform miscellaneous refinement.

  • description Optional corpus context. Helps the model choose better category boundaries and names.

  • print_estimate If true, a cost estimate is computed before any model calls. The estimate is also used when confirm or max_budget are set.

  • confirm If true, the function will prompt in the console before running. Useful for scripts.

  • max_budget Float in USD. If the estimate exceeds this amount, the function returns an empty result without calling the model.

  • model Optional model name to override the default. If omitted, a capable multimodal GPT model is used by default.

  • rename_final If true, the library proposes clearer category names at the end based on samples from each group.

Returns

  • results Dict with key sorted_results. Values are lists of the original items per category. A Miscellaneous bucket is always present.

  • stats Dict with detailed run information

    • model
    • items_total
    • initial_categories_count
    • final_categories_count
    • miscellaneous_count
    • calls with counts for internal steps
    • retries for API retries with backoff
    • tokens with input and output counts
    • combination_attempts and combined_merges
    • elapsed_seconds
    • cost_usd estimated from token counts and the internal price table
    • category_sizes mapping category to item count

How it works

  • Category discovery The model reads a sample of your corpus and proposes a set of category names between your bounds.

  • Sorting Each item is scored for every discovered category with confidences high, medium, low. High confidence categories win. Ties are broken by simple rules.

  • Combining overlapping themes When items frequently score high for the same pairs or sets of categories, the library asks the model if they should be combined. On a high decision, a single concise name is requested and the merge proceeds.

  • Refining miscellaneous If Miscellaneous grows larger than a data guided threshold, the same discovery and sorting routine runs on that subset. Items are pulled out into new focused categories when possible.

  • Renaming for clarity At the end, the library proposes clearer names that preserve meaning using a small sample from each category. Names are deduplicated.

Cost, tokens, and budgets

  • Token accounting uses tiktoken with an encoder chosen for the active model.
  • The estimate and the final cost_usd are derived from token counts and an internal price table. Treat these as helpful approximations.
  • Use max_budget to enforce a strict upper bound before any calls are made.
  • Use print_estimate or confirm when running in scripts where you want an explicit checkpoint.

Advanced examples

Run with a budget and keep original names

from assort import assort, Policy

texts = [...]
results, stats = assort(
    texts,
    min_clusters=4,
    max_clusters=8,
    description="Product feedback notes",
    policy=Policy.fuzzy,
    max_budget=0.75,
    rename_final=False,
)

Inspect stats for simple analytics

results, stats = assort(texts)

sizes = stats["category_sizes"]
by_size = sorted(sizes.items(), key=lambda kv: kv[1], reverse=True)
for name, count in by_size:
    print(name, count)

Behavior notes

  • Non deterministic sampling is used during corpus selection, so runs can vary.
  • The module keeps a single OpenAI client and encoder in module scope. In process concurrency is not recommended. Use separate processes for parallel work.
  • The console prompt only appears when confirm=True. Avoid this in non interactive environments.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

assort-1.0.0.tar.gz (10.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

assort-1.0.0-py3-none-any.whl (8.4 kB view details)

Uploaded Python 3

File details

Details for the file assort-1.0.0.tar.gz.

File metadata

  • Download URL: assort-1.0.0.tar.gz
  • Upload date:
  • Size: 10.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.0

File hashes

Hashes for assort-1.0.0.tar.gz
Algorithm Hash digest
SHA256 2763588dcb44fb64b02e96ca3b8ac763b8fda54358f0736464721dfb15b31e27
MD5 e79975e5c4cfd443ebb5c005d2e8f07b
BLAKE2b-256 7995a86cbf4bda375dff46289ab7c63c534f99850774c2105f6adfb37d4d546d

See more details on using hashes here.

File details

Details for the file assort-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: assort-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 8.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.0

File hashes

Hashes for assort-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f4ea07a4cb496c47769128c5f297505c99d503550dbc7c60e3b6b727505fb2b0
MD5 4152f59fc4e82934f9ed527342a30f99
BLAKE2b-256 a972bcec10310a9be2d67857380ec80effe8b8fd4d9314e34e44556e700de375

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page