LLM based text categorization tool
Project description
Categorizer
Categorizer is a simple python package which you can use to categorize your string records into predefined -nested- categories using the power of LLMs.
Features
-
Hierarchical Categories
Define multi‑level category trees incategories.yaml. Parent/child relationships are enforced bidirectionally. -
LLM‑Driven Categorization
Use the power of LLMs to semantically categorize any string into predefined categories in a level‑by‑level fashion.
-
Meta‑pattern based Categorization (optional)
When dealing with categorizayion It is common to encounter some record groups which follow specific pattern and therefore allow us to use regex‑based patterns to categorize them. This package allows you to setup such egex‑based “classification_patterns” inbank_patterns.yaml(or your own file) for instant, rule‑based tagging. -
Keyword Trigger based Categorization (optional)
Naïve but fast auto‑trigger on keywords defined per category (“auto_trigger_keyword”). -
Flexible Record I/O
Load records from a Pandas DataFrame, Python list, or single string. Outputs a DataFrame with selected categories, rationale, and more. -
Prompt Templating & Pipelines
Customize prompt order and post‑processing via pipeline stages (e.g., “SemanticIsolation”). -
Extensible & Async‑Ready
Easily extendMyLLMServicefor new operations. Supports async translation & classification endpoints.
Installation
pip install categorizer
Or clone & install locally:
git clone https://github.com/karaposu/categorizer.git
cd categorizer
pip install .
Usage
Initial Configuration
-
**Define your Categories **
Editcategorizer/categories.yamlto define your category tree. keyword_identifier field is used for keyword_trigger.- Finance: helpers: keyword_identifier: ["invoice", "payment"] text_rules_for_llm: [] description: "All finance‑related records" subcategories: - Revenue: helpers: keyword_identifier: ["sale", "subscription"] text_rules_for_llm: [] description: "" - Expense: helpers: keyword_identifier: ["purchase", "refund"] text_rules_for_llm: [] description: ""
-
(Optional) Define your Meta‑Patterns
Editcategorizer/bank_patterns.yamlundermeta_patterns.<owner>.classification_patternsto add regex rules for already known pattern clusters in your datasetmeta_patterns: default: classification_patterns: - pattern: "(?i)refund" lvl1: Expense lvl2: Refund
-
Upload your records and run the categorization
import pandas as pd
from categorizer.record_manager import RecordManager
# Sample DataFrame
df = pd.DataFrame([
{"text": "Dinner at Gray House café", "record_id": 1},
{"text": "Electricity bill from VVC", "record_id": 2},
])
# Initialize manager
rm = RecordManager(debug=True)
# Load & categorize
rm.load_records(df, categories_yaml_path="categorizer/categories.yaml")
result_df = rm.categorize_records()
print(result_df)
Quick Start to Internals
1. CategorizationEngine (Standalone)
from categorizer.categorization_engine import CategorizationEngine
from categorizer.record import Record
# Initialize engine
engine = CategorizationEngine(subcategory_level=2, debug=True)
# Create a Record
rec = Record.from_string(
text="Subscription payment to Netflix",
record_id=123,
categories="categorizer/categories.yaml"
)
# Run regex & keyword first, then LLM fallback
engine.categorize_record(rec, use_metapattern=True, use_keyword=True)
print("Level 1:", rec.lvl1.name)
print("Level 2:", rec.lvl2.name)
print("By:", rec.categorized_by)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file categorizer-0.0.2.tar.gz.
File metadata
- Download URL: categorizer-0.0.2.tar.gz
- Upload date:
- Size: 18.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.22
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
446c1873eb03c2dc2747815767af88ffbeda918f7986c1b93cab56f507aef9b1
|
|
| MD5 |
155aba970059da33e2ca0bfad6c8e429
|
|
| BLAKE2b-256 |
77325bb18d5d4697ba0ecaa577c283564c196ad5f92b1b9d1d28e7b4e6a70798
|
File details
Details for the file categorizer-0.0.2-py3-none-any.whl.
File metadata
- Download URL: categorizer-0.0.2-py3-none-any.whl
- Upload date:
- Size: 19.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.22
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
853a82dc210c25c418f8ed1de2eeaccf51c0a0a9c7dad4cc72a35feac7d67159
|
|
| MD5 |
ff8e8ce87fee7689eb2b944b1162337a
|
|
| BLAKE2b-256 |
3542cd70263a5126ffdac656c26824d21127b83496a49668cee1b784b10d638f
|