Skip to main content

LLM based text categorization tool

Project description

Categorizer

Categorizer is a simple python package which you can use to categorize your string records into predefined -nested- categories using the power of LLMs.


Features

  • Hierarchical Categories
    Define multi‑level category trees in categories.yaml. Parent/child relationships are enforced bidirectionally.

  • LLM‑Driven Categorization

    Use the power of LLMs to semantically categorize any string into predefined categories in a level‑by‑level fashion.

  • Meta‑pattern based Categorization (optional)
    When dealing with categorizayion It is common to encounter some record groups which follow specific pattern and therefore allow us to use regex‑based patterns to categorize them. This package allows you to setup such egex‑based “classification_patterns” in bank_patterns.yaml (or your own file) for instant, rule‑based tagging.

  • Keyword Trigger based Categorization (optional)
    Naïve but fast auto‑trigger on keywords defined per category (“auto_trigger_keyword”).

  • Flexible Record I/O
    Load records from a Pandas DataFrame, Python list, or single string. Outputs a DataFrame with selected categories, rationale, and more.

  • Prompt Templating & Pipelines
    Customize prompt order and post‑processing via pipeline stages (e.g., “SemanticIsolation”).

  • Extensible & Async‑Ready
    Easily extend MyLLMService for new operations. Supports async translation & classification endpoints.


Installation

pip install categorizer

Or clone & install locally:

git clone https://github.com/karaposu/categorizer.git
cd categorizer
pip install .

Usage

Initial Configuration

  1. **Define your Categories **
    Edit categorizer/categories.yaml to define your category tree. keyword_identifier field is used for keyword_trigger.

    - Finance:
        helpers:
          keyword_identifier: ["invoice", "payment"]
          text_rules_for_llm: []
          description: "All finance‑related records"
        subcategories:
          - Revenue:
              helpers:
                keyword_identifier: ["sale", "subscription"]
                text_rules_for_llm: []
                description: ""
          - Expense:
              helpers:
                keyword_identifier: ["purchase", "refund"]
                text_rules_for_llm: []
                description: ""
    
  2. (Optional) Define your Meta‑Patterns
    Edit categorizer/bank_patterns.yaml under meta_patterns.<owner>.classification_patterns to add regex rules for already known pattern clusters in your dataset

    meta_patterns:
      default:
        classification_patterns:
          - pattern: "(?i)refund"
            lvl1: Expense
            lvl2: Refund
    
  3. Upload your records and run the categorization

import pandas as pd
from categorizer.record_manager import RecordManager

# Sample DataFrame
df = pd.DataFrame([
    {"text": "Dinner at Gray House café", "record_id": 1},
    {"text": "Electricity bill from VVC",   "record_id": 2},
])

# Initialize manager
rm = RecordManager(debug=True)

# Load & categorize
rm.load_records(df, categories_yaml_path="categorizer/categories.yaml")
result_df = rm.categorize_records()

print(result_df)

Quick Start to Internals

1. CategorizationEngine (Standalone)

from categorizer.categorization_engine import CategorizationEngine
from categorizer.record import Record

# Initialize engine
engine = CategorizationEngine(subcategory_level=2, debug=True)

# Create a Record
rec = Record.from_string(
    text="Subscription payment to Netflix",
    record_id=123,
    categories="categorizer/categories.yaml"
)

# Run regex & keyword first, then LLM fallback
engine.categorize_record(rec, use_metapattern=True, use_keyword=True)

print("Level 1:", rec.lvl1.name)
print("Level 2:", rec.lvl2.name)
print("By:", rec.categorized_by)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

categorizer-0.0.2.tar.gz (18.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

categorizer-0.0.2-py3-none-any.whl (19.3 kB view details)

Uploaded Python 3

File details

Details for the file categorizer-0.0.2.tar.gz.

File metadata

  • Download URL: categorizer-0.0.2.tar.gz
  • Upload date:
  • Size: 18.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.22

File hashes

Hashes for categorizer-0.0.2.tar.gz
Algorithm Hash digest
SHA256 446c1873eb03c2dc2747815767af88ffbeda918f7986c1b93cab56f507aef9b1
MD5 155aba970059da33e2ca0bfad6c8e429
BLAKE2b-256 77325bb18d5d4697ba0ecaa577c283564c196ad5f92b1b9d1d28e7b4e6a70798

See more details on using hashes here.

File details

Details for the file categorizer-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: categorizer-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 19.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.22

File hashes

Hashes for categorizer-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 853a82dc210c25c418f8ed1de2eeaccf51c0a0a9c7dad4cc72a35feac7d67159
MD5 ff8e8ce87fee7689eb2b944b1162337a
BLAKE2b-256 3542cd70263a5126ffdac656c26824d21127b83496a49668cee1b784b10d638f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page