Using AI to sort lists of unstructured text through iterative batches.
Project description
assort
Text clustering and sorting with an LLM that discovers categories, classifies items, optionally merges overlapping themes, and cleans up labels. Designed for quick drop in use with a single function call and clear cost tracking.
Highlights
- Discovers category names from your data
- Sorts each item with calibrated confidence per category
- Merges overlapping themes when the model judges a high likelihood of overlap
- Refines the miscellaneous bucket when it is too large
- Optionally renames categories to be clearer and more specific
- Tracks tokens and estimated cost in USD
- Simple one function API that returns results and rich stats
Install
pip install assort
You also need an OpenAI API key available to the runtime, for example
export OPENAI_API_KEY=sk_your_key_here
Quick start
from assort import assort
texts = [
"Build a responsive landing page in React",
"How to index a Postgres table",
"Cognitive behavioral therapy exercises",
"Vector search with Azure AI Search",
"Tailwind utility classes for layouts",
"Managing anxiety before a big presentation",
]
results, stats = assort(
texts,
min_clusters=3,
max_clusters=6,
description="Short notes that mix software topics and mental health topics",
)
print(results["sorted_results"])
print(round(stats["cost_usd"], 4), "USD")
Example shape of sorted_results
{
"Front End Engineering": [
"Build a responsive landing page in React",
"Tailwind utility classes for layouts"
],
"Data and Search": [
"Vector search with Azure AI Search",
"How to index a Postgres table"
],
"Anxiety and CBT": [
"Cognitive behavioral therapy exercises",
"Managing anxiety before a big presentation"
],
"Miscellaneous": []
}
API
assort
results, stats = assort(
batch,
min_clusters=2,
max_clusters=5,
policy=None,
description="",
print_estimate=False,
confirm=False,
max_budget=None,
model=None,
rename_final=True,
)
Parameters
-
batchList of strings to categorize. Empty or blank strings are ignored. -
min_clustersandmax_clustersBounds for initial category discovery. -
policyPolicy.fuzzyorPolicy.exhaustive. This affects internal cost estimation. Both modes perform miscellaneous refinement. -
descriptionOptional corpus context. Helps the model choose better category boundaries and names. -
print_estimateIf true, a cost estimate is computed before any model calls. The estimate is also used whenconfirmormax_budgetare set. -
confirmIf true, the function will prompt in the console before running. Useful for scripts. -
max_budgetFloat in USD. If the estimate exceeds this amount, the function returns an empty result without calling the model. -
modelOptional model name to override the default. If omitted, a capable multimodal GPT model is used by default. -
rename_finalIf true, the library proposes clearer category names at the end based on samples from each group.
Returns
-
resultsDict with keysorted_results. Values are lists of the original items per category. AMiscellaneousbucket is always present. -
statsDict with detailed run informationmodelitems_totalinitial_categories_countfinal_categories_countmiscellaneous_countcallswith counts for internal stepsretriesfor API retries with backofftokenswithinputandoutputcountscombination_attemptsandcombined_mergeselapsed_secondscost_usdestimated from token counts and the internal price tablecategory_sizesmapping category to item count
How it works
-
Category discovery The model reads a sample of your corpus and proposes a set of category names between your bounds.
-
Sorting Each item is scored for every discovered category with confidences high, medium, low. High confidence categories win. Ties are broken by simple rules.
-
Combining overlapping themes When items frequently score high for the same pairs or sets of categories, the library asks the model if they should be combined. On a high decision, a single concise name is requested and the merge proceeds.
-
Refining miscellaneous If
Miscellaneousgrows larger than a data guided threshold, the same discovery and sorting routine runs on that subset. Items are pulled out into new focused categories when possible. -
Renaming for clarity At the end, the library proposes clearer names that preserve meaning using a small sample from each category. Names are deduplicated.
Cost, tokens, and budgets
- Token accounting uses
tiktokenwith an encoder chosen for the active model. - The estimate and the final
cost_usdare derived from token counts and an internal price table. Treat these as helpful approximations. - Use
max_budgetto enforce a strict upper bound before any calls are made. - Use
print_estimateorconfirmwhen running in scripts where you want an explicit checkpoint.
Advanced examples
Run with a budget and keep original names
from assort import assort, Policy
texts = [...]
results, stats = assort(
texts,
min_clusters=4,
max_clusters=8,
description="Product feedback notes",
policy=Policy.fuzzy,
max_budget=0.75,
rename_final=False,
)
Inspect stats for simple analytics
results, stats = assort(texts)
sizes = stats["category_sizes"]
by_size = sorted(sizes.items(), key=lambda kv: kv[1], reverse=True)
for name, count in by_size:
print(name, count)
Behavior notes
- Non deterministic sampling is used during corpus selection, so runs can vary.
- The module keeps a single OpenAI client and encoder in module scope. In process concurrency is not recommended. Use separate processes for parallel work.
- The console prompt only appears when
confirm=True. Avoid this in non interactive environments.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file assort-1.0.0.tar.gz.
File metadata
- Download URL: assort-1.0.0.tar.gz
- Upload date:
- Size: 10.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2763588dcb44fb64b02e96ca3b8ac763b8fda54358f0736464721dfb15b31e27
|
|
| MD5 |
e79975e5c4cfd443ebb5c005d2e8f07b
|
|
| BLAKE2b-256 |
7995a86cbf4bda375dff46289ab7c63c534f99850774c2105f6adfb37d4d546d
|
File details
Details for the file assort-1.0.0-py3-none-any.whl.
File metadata
- Download URL: assort-1.0.0-py3-none-any.whl
- Upload date:
- Size: 8.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f4ea07a4cb496c47769128c5f297505c99d503550dbc7c60e3b6b727505fb2b0
|
|
| MD5 |
4152f59fc4e82934f9ed527342a30f99
|
|
| BLAKE2b-256 |
a972bcec10310a9be2d67857380ec80effe8b8fd4d9314e34e44556e700de375
|