Skip to main content

LLM based text categorization tool

Project description

Categorizer

Categorizer is a simple tool which you can use to categorise your string records into predefined -nested- categories using the power of LLMs.

  • upload your categories and subcategories ( read from yaml file or create them on the fly )
  • initialize LEC. You can use different modes and your own task-specific prompts if you like
  • Run the LEC and it will output a dataframe with all categories and subcategories and llm's reasoning to select them
  • It is possible to leave notes for LLM for each category to help him categorize with more accuracy
  • You can also use included naive classification method which supports regex based or keyword matching mechanism to reduce the LLM compute

Here is some benchmarking you to understand better.

category depth category combination size allowed retry

Number of Records Main Model Refiner Model Categorization Mode Batch Prompting Accuracy Total Time Avg Token CPU Type GPU Type
1000 Model A Refiner X Mode 1 Yes 92.5% 10 mins 512 Intel Xeon E5-2670 NVIDIA Tesla K80
2000 Model B Refiner Y Mode 2 No 89.0% 20 mins 1024 Intel Xeon E5-2680 NVIDIA Tesla V100
5000 Model C Refiner Z Mode 3 Yes 94.7% 50 mins 768 AMD EPYC 7742 NVIDIA A100
10000 Model D Refiner W Mode 4 No 88.3% 1 hr 40 mins 2048 Intel Xeon E5-2690 NVIDIA RTX 3090

Usage

lec = LLMEnhancedClassifier(
        llm_model=llm_model,
        llm_refiner_model=llm_refiner,
        categories_yaml_path='categories.yaml',
        meta_patterns_yaml_path='bank_patterns.yaml',
        subcategory_level=2  # Change this value to set the number of subcategories (max 4)
    )
    
lec.load_records(df)
 df = lec.classify_lvl_by_lvl()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataimputer-0.0.2.tar.gz (2.3 kB view details)

Uploaded Source

Built Distribution

dataimputer-0.0.2-py3-none-any.whl (2.1 kB view details)

Uploaded Python 3

File details

Details for the file dataimputer-0.0.2.tar.gz.

File metadata

  • Download URL: dataimputer-0.0.2.tar.gz
  • Upload date:
  • Size: 2.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for dataimputer-0.0.2.tar.gz
Algorithm Hash digest
SHA256 3a379e4c90137b06b8df85ed0a0f208f347fd774e7b5831e986190b5254c1b71
MD5 036084f73db8eef54733ee2c188ca71d
BLAKE2b-256 5f276fc2a582ef36dacaf1d7debbd20fa35998d284d1a17b6e32934721a7f52b

See more details on using hashes here.

File details

Details for the file dataimputer-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: dataimputer-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 2.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for dataimputer-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 75f84fc0411247829d6426f30b25478717340712e41b1872989616b825ff3008
MD5 e28246507efbff8036e336e50426dad3
BLAKE2b-256 2516456f7f6e3ae9e2365d79a99437fdfbf8e633be3b4eb4bc1bf9472073da65

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page