Using AI to sort lists of unstructured text through iterative batches.

These details have not been verified by PyPI

Project description

assort

Text clustering and sorting with an LLM that discovers categories, classifies items, optionally merges overlapping themes, and cleans up labels. Designed for quick drop in use with a single function call and clear cost tracking.

Highlights

Discovers category names from your data
Sorts each item with calibrated confidence per category
Merges overlapping themes when the model judges a high likelihood of overlap
Refines the miscellaneous bucket when it is too large
Optionally renames categories to be clearer and more specific
Tracks tokens and final cost in USD
Simple one function API that returns results and rich stats

Install

pip install assort

You also need an OpenAI API key available to the runtime, for example

export OPENAI_API_KEY=sk_your_key_here

Quick start

from assort import assort

texts = [
    "Build a responsive landing page in React",
    "How to index a Postgres table",
    "Cognitive behavioral therapy exercises",
    "Vector search with Azure AI Search",
    "Tailwind utility classes for layouts",
    "Managing anxiety before a big presentation",
]

results, stats = assort(
    texts,
    min_clusters=3,
    max_clusters=6,
    description="Short notes that mix software topics and mental health topics",
)

print(results["sorted_results"])

Example shape of sorted_results

{
    "Front End Engineering": [
        "Build a responsive landing page in React",
        "Tailwind utility classes for layouts"
    ],
    "Data and Search": [
        "Vector search with Azure AI Search",
        "How to index a Postgres table"
    ],
    "Anxiety and CBT": [
        "Cognitive behavioral therapy exercises",
        "Managing anxiety before a big presentation"
    ],
    "Miscellaneous": []
}

API

assort

results, stats = assort(
    batch,
    min_clusters=2,
    max_clusters=5,
    description="",
    model=None,
    rename_final=True,
    show_progress=False,
)

Parameters

batch List of strings to categorize. Empty or blank strings are ignored.
min_clusters and max_clusters Bounds for initial category discovery.
description Optional corpus context. Helps the model choose better category boundaries and names.
model Optional model name to override the default. If omitted, a capable multimodal GPT model is used by default.
rename_final If true, the library proposes clearer category names at the end based on samples from each group.
show_progress If true, displays progress while items are assorted.

Returns

results Dict with key sorted_results. Values are lists of the original items per category. A Miscellaneous bucket is always present.
stats Dict with detailed run information
- model
- items_total
- initial_categories_count
- final_categories_count
- miscellaneous_count
- calls with counts for internal steps
- retries for API retries with backoff
- tokens with input and output counts
- combination_attempts and combined_merges
- elapsed_seconds
- cost_usd calculated from token counts and the internal price table
- category_sizes mapping category to item count

How it works

Category discovery The model reads a sample of your corpus and proposes a set of category names between your bounds.
Sorting Each item is scored for every discovered category with confidences high, medium, low. High confidence categories win. Ties are broken by simple rules.
Combining overlapping themes When items frequently score high for the same pairs or sets of categories, the library asks the model if they should be combined. On a high decision, a single concise name is requested and the merge proceeds.
Refining miscellaneous If Miscellaneous grows larger than a data guided threshold, the same discovery and sorting routine runs on that subset. Items are pulled out into new focused categories when possible.
Renaming for clarity At the end, the library proposes clearer names that preserve meaning using a small sample from each category. Names are deduplicated.

Cost and tokens

Token accounting uses tiktoken with an encoder chosen for the active model.
The final cost_usd is calculated from token counts and an internal price table.

Advanced examples

Run and keep original names

from assort import assort

texts = [...]
results, stats = assort(
    texts,
    min_clusters=4,
    max_clusters=8,
    description="Product feedback notes",
    rename_final=False,
)

Inspect stats for simple analytics

results, stats = assort(texts)

sizes = stats["category_sizes"]
by_size = sorted(sizes.items(), key=lambda kv: kv[1], reverse=True)
for name, count in by_size:
    print(name, count)

Behavior notes

Non deterministic sampling is used during corpus selection, so runs can vary.
The module keeps a single OpenAI client and encoder in module scope. In process concurrency is not recommended. Use separate processes for parallel work.

Project details

These details have not been verified by PyPI

Development Status
- 2 - Pre-Alpha
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

2.0.0

Jul 22, 2026

1.0.0

Aug 25, 2025

0.2.1

Jul 15, 2025

0.1.0

Jul 12, 2025

0.0.1

Jul 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

assort-2.0.0.tar.gz (9.6 kB view details)

Uploaded Jul 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

assort-2.0.0-py3-none-any.whl (7.8 kB view details)

Uploaded Jul 22, 2026 Python 3

File details

Details for the file assort-2.0.0.tar.gz.

File metadata

Download URL: assort-2.0.0.tar.gz
Upload date: Jul 22, 2026
Size: 9.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for assort-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`ddd73a2611a1a29b90b9a05ec23559f386a8e4633edc9655a40203d4935783bf`
MD5	`f69660ed1e5789771baa36f42f1825f2`
BLAKE2b-256	`34a8cd124c66788dbc4b5836cd62ff88701d509dd87d0f3e11dad8f1b5a37bb7`

See more details on using hashes here.

File details

Details for the file assort-2.0.0-py3-none-any.whl.

File metadata

Download URL: assort-2.0.0-py3-none-any.whl
Upload date: Jul 22, 2026
Size: 7.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for assort-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4c2f771acbf6c02e3c392426a7bd39d9f68bc5bf0970f9239760a211da99925d`
MD5	`02f794b158a08b629bce97233f0735a3`
BLAKE2b-256	`23a73810c3d394d8b0007ce27141d1e4ec74dcf3e76afc09c38b75b41d591575`

See more details on using hashes here.

assort 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

assort

Highlights

Install

Quick start

API

assort

How it works

Cost and tokens

Advanced examples

Behavior notes

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes