A tool for categorizing text data and images using LLMs and vision models

These details have not been verified by PyPI

Project links

Project description

catllm Logo

catllm

Installation
Quick Start
Configuration
Supported Models
API Reference
Academic Research
License

Installation

pip install cat-llm

Quick Start

The explore_corpus function extracts a list of all categories present in the corpus as identified by the model.

import catllm as cat
import os

categories = cat.explore_corpus(
survey_question="What motivates you most at work?",
survey_input=["flexible schedule", "good pay", "interesting projects"],
api_key="OPENAI_API_KEY",
cat_num=5,
divisions=10
)
print(categories)

Configuration

Get Your OpenAI API Key

Create an OpenAI Developer Account:
- Go to platform.openai.com (separate from regular ChatGPT)
- Sign up with email, Google, Microsoft, or Apple
Generate an API Key:
- Log into your account and click your name in the top right corner
- Click "View API keys" or navigate to the "API keys" section
- Click "Create new secret key"
- Give your key a descriptive name
- Set permissions (choose "All" for full access)
Add Payment Details:
- Add a payment method to your OpenAI account
- Purchase credits (start with $5 - it lasts a long time for most research use)
- Important: Your API key won't work without credits
Save Your Key Securely:
- Copy the key immediately (you won't be able to see it again)
- Store it safely and never share it publicly
Copy and paste your key into catllm in the api_key parameter

Supported Models

OpenAI: GPT-4o, GPT-4, GPT-3.5-turbo, etc.
Anthropic: Claude Sonnet 3.7, Claude Haiku, etc.
Perplexity: Sonnar Large, Sonnar Small, etc.
Mistral: Mistral Large, Mistral Small, etc.

API Reference

`explore_corpus()`

Extracts categories from a corpus of text responses and returns frequency counts.

Methodology: The function divides the corpus into random chunks to address the probabilistic nature of LLM outputs. By processing multiple chunks and averaging results across many API calls rather than relying on a single call, this approach significantly improves reproducibility and provides more stable categorical frequency estimates.

Parameters:

survey_question (str): The survey question being analyzed
survey_input (list): List of text responses to categorize
api_key (str): API key for the LLM service
cat_num (int, default=10): Number of categories to extract in each iteration
divisions (int, default=5): Number of chunks to divide the data into (larger corpora might require larger divisions)
specificity (str, default="broad"): Category precision level (e.g., "broad", "narrow")
model_source (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
user_model (str, default="got-4o"): Specific model (e.g., "gpt-4o", "claude-opus-4-20250514")
creativity (float, default=0): Temperature/randomness setting (0.0-1.0)
filename (str, optional): Output file path for saving results

Returns:

pandas.DataFrame: Two-column dataset with category names and frequencies

Example:*

import catllm as cat

categories = cat.explore_corpus(
survey_question="What motivates you most at work?",
survey_input=["flexible schedule", "good pay", "interesting projects"],
api_key="OPENAI_API_KEY",
cat_num=5,
divisions=10
)

`explore_common_categories()`

Identifies the most frequently occurring categories across a text corpus and returns the top N categories by frequency count.

Methodology: Divides the corpus into random chunks and averages results across multiple API calls to improve reproducibility and provide stable frequency estimates for the most prevalent categories, addressing the probabilistic nature of LLM outputs.

Parameters:

survey_question (str): Survey question being analyzed
survey_input (list): Text responses to categorize
api_key (str): API key for the LLM service
top_n (int, default=10): Number of top categories to return by frequency
cat_num (int, default=10): Number of categories to extract per iteration
divisions (int, default=5): Number of data chunks (increase for larger corpora)
user_model (str, default="gpt-4o"): Specific model to use
creativity (float, default=0): Temperature/randomness setting (0.0-1.0)
specificity (str, default="broad"): Category precision level ("broad", "narrow")
research_question (str, optional): Contextual research question to guide categorization
filename (str, optional): File path to save output dataset
model_source (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")

Returns:

pandas.DataFrame: Dataset with category names and frequencies, limited to top N most common categories

Example:

import catllm as cat

top_10_categories = cat.explore_common_categories(
survey_question="What motivates you most at work?",
survey_input=["flexible schedule", "good pay", "interesting projects"],
api_key="OPENAI_API_KEY",
top_n=10,
cat_num=5,
divisions=10
)
print(categories)

`multi_class()`

Performs multi-label classification of text responses into user-defined categories, returning structured results with optional CSV export.

Methodology: Processes each text response individually, assigning one or more categories from the provided list. Supports flexible output formatting and optional saving of results to CSV for easy integration with data analysis workflows.

Parameters:

survey_question (str): The survey question being analyzed
survey_input (list): List of text responses to classify
categories (list): List of predefined categories for classification
api_key (str): API key for the LLM service
user_model (str, default="gpt-4o"): Specific model to use
creativity (float, default=0): Temperature/randomness setting (0.0-1.0)
safety (bool, default=False): Enable safety checks on responses and saves to CSV at each API call step
filename (str, default="categorized_data.csv"): Filename for CSV output
save_directory (str, optional): Directory path to save the CSV file
model_source (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")

Returns:

pandas.DataFrame: DataFrame with classification results, columns formatted as specified

Example:

import catllm as cat

user_categories = ["to start living with or to stay with partner/spouse",
                   "relationship change (divorce, breakup, etc)",
                   "the person had a job or school or career change, including transferred and retired",
                   "the person's partner's job or school or career change, including transferred and retired",
                   "financial reasons (rent is too expensive, pay raise, etc)",
                   "related specifically features of the home, such as a bigger or smaller yard"]

question = "Why did you move?"                   

move_reasons = cat.multi_class(
    survey_question=question, 
    survey_input= df[column1], 
    user_model="gpt-4o",
    creativity=0,
    categories=user_categories,
    safety =TRUE,
    api_key="OPENAI_API_KEY")

`image_multi_class()`

Performs multi-label image classification into user-defined categories, returning structured results with optional CSV export.

Methodology: Processes each image individually, assigning one or more categories from the provided list. Supports flexible output formatting and optional saving of results to CSV for easy integration with data analysis workflows.

Parameters:

image_description (str): A description of what the model should expect to see
image_input (list): List of file paths or a folder to pull file paths from
categories (list): List of predefined categories for classification
api_key (str): API key for the LLM service
user_model (str, default="gpt-4o"): Specific model to use
creativity (float, default=0): Temperature/randomness setting (0.0-1.0)
safety (bool, default=False): Enable safety checks on responses and saves to CSV at each API call step
filename (str, default="categorized_data.csv"): Filename for CSV output
save_directory (str, optional): Directory path to save the CSV file
model_source (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")

Returns:

pandas.DataFrame: DataFrame with classification results, columns formatted as specified

Example:

import catllm as cat

user_categories = ["has a cat somewhere in it",
                   "looks cartoonish",
                   "Adrian Brody is in it"]

description = "Should be an image of a child's drawing"                   

image_categories = cat.image_multi_class(
    image_description=description, 
    image_input= ['desktop/image1.jpg','desktop/image2.jpg', desktop/image3.jpg'], 
    user_model="gpt-4o",
    creativity=0,
    categories=user_categories,
    safety =TRUE,
    api_key="OPENAI_API_KEY")

`image_score()`

Performs quality scoring of images against a reference description, returning structured results with optional CSV export.

Methodology: Processes each image individually, assigning a quality score on a 5-point scale based on similarity to the expected description:

1: No meaningful similarity (fundamentally different)
2: Barely recognizable similarity (25% match)
3: Partial match (50% key features)
4: Strong alignment (75% features)
5: Near-perfect match (90%+ similarity)

Supports flexible output formatting and optional saving of results to CSV for easy integration with data analysis workflows[5].

Parameters:

reference_image_description (str): A description of what the model should expect to see
image_input (list): List of image file paths or folder path containing images
reference_image (str): A file path to the reference image
api_key (str): API key for the LLM service
user_model (str, default="gpt-4o"): Specific vision model to use
creativity (float, default=0): Temperature/randomness setting (0.0-1.0)
safety (bool, default=False): Enable safety checks and save results at each API call step
filename (str, default="image_scores.csv"): Filename for CSV output
save_directory (str, optional): Directory path to save the CSV file
model_source (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")

Returns:

pandas.DataFrame: DataFrame with image paths, quality scores, and analysis details

Example:

import catllm as cat          

image_scores = cat.image_score(
    reference_image_description='Adrien Brody sitting in a lawn chair, 
    image_input= ['desktop/image1.jpg','desktop/image2.jpg', desktop/image3.jpg'], 
    user_model="gpt-4o",
    creativity=0,
    safety =TRUE,
    api_key="OPENAI_API_KEY")

`image_features()`

Extracts specific features and attributes from images, returning exact answers to user-defined questions (e.g., counts, colors, presence of objects).

Methodology: Processes each image individually using vision models to extract precise information about specified features. Unlike scoring and multi-class functions, this returns factual data such as object counts, color identification, or presence/absence of specific elements. Supports flexible output formatting and optional CSV export for quantitative analysis workflows.

Parameters:

image_description (str): A description of what the model should expect to see
image_input (list): List of image file paths or folder path containing images
features_to_extract (list): List of specific features to extract (e.g., ["number of people", "primary color", "contains text"])
api_key (str): API key for the LLM service
user_model (str, default="gpt-4o"): Specific vision model to use
creativity (float, default=0): Temperature/randomness setting (0.0-1.0)
to_csv (bool, default=False): Whether to save the output to a CSV file
safety (bool, default=False): Enable safety checks and save results at each API call step
filename (str, default="categorized_data.csv"): Filename for CSV output
save_directory (str, optional): Directory path to save the CSV file
model_source (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")

Returns:

pandas.DataFrame: DataFrame with image paths and extracted feature values for each specified attribute[1][4]

Example:

import catllm as cat          

image_scores = cat.image_features(
    image_description='An AI generated image of Spongebob dancing with Patrick', 
    features_to_extract=['Spongebob is yellow','Both are smiling','Patrick is chunky']
    image_input= ['desktop/image1.jpg','desktop/image2.jpg', desktop/image3.jpg'], 
    model_source= 'OpenAI',
    user_model="gpt-4o",
    creativity=0,
    safety =TRUE,
    api_key="OPENAI_API_KEY")

`cerad_drawn_score()`

Automatically scores drawings of circles, diamonds, overlapping rectangles, and cubes according to the official Consortium to Establish a Registry for Alzheimer's Disease (CERAD) scoring system, returning structured results with optional CSV export. Works even with images that contain other drawings or writing.

Methodology: Processes each image individually, evaluating the drawn shapes based on CERAD criteria. Supports optional inclusion of reference shapes within images and can provide reference examples if requested. The function outputs standardized scores facilitating reproducible analysis and integrates optional safety checks and CSV export for research workflows.

Parameters:

shape (str): The type of shape to score (e.g., "circle", "diamond", "overlapping rectangles", "cube")
image_input (list): List of image file paths or folder path containing images
api_key (str): API key for the LLM service
user_model (str, default="gpt-4o"): Specific model to use
creativity (float, default=0): Temperature/randomness setting (0.0-1.0)
reference_in_image (bool, default=False): Whether a reference shape is present in the image for comparison
provide_reference (bool, default=False): Whether to provide a reference example image (built in reference image)
safety (bool, default=False): Enable safety checks and save results at each API call step
filename (str, default="categorized_data.csv"): Filename for CSV output
model_source (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Mistral")

Returns:

pandas.DataFrame: DataFrame with image paths, CERAD scores, and analysis details

Example:

import catllm as cat  

diamond_scores = cat.cerad_score(
    shape="diamond",
    image_input=df['diamond_pic_path'],
    api_key=open_ai_key,
    safety=True,
    filename="diamond_gpt_score.csv",
)

Academic Research

This package implements methodology from research on LLM performance in social science applications, including the UC Berkeley Social Networks Study. The package addresses reproducibility challenges in LLM-assisted research by providing standardized interfaces and consistent output formatting.

License

cat-llm is distributed under the terms of the GNU license.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

3.0.0

Mar 19, 2026

2.7.0

Mar 7, 2026

2.5.1

Mar 1, 2026

2.5.0

Feb 26, 2026

2.4.0

Feb 11, 2026

2.3.4

Feb 11, 2026

2.3.3

Feb 11, 2026

2.3.2

Feb 11, 2026

2.3.1

Feb 10, 2026

2.3.0

Feb 9, 2026

2.2.0

Feb 8, 2026

2.1.0

Jan 30, 2026

2.0.0

Jan 17, 2026

0.1.15

Jan 11, 2026

0.1.14

Jan 11, 2026

0.1.13

Jan 9, 2026

0.1.12

Jan 9, 2026

0.1.11

Jan 9, 2026

0.1.10

Jan 9, 2026

0.1.9

Jan 9, 2026

0.1.8

Jan 7, 2026

0.1.7

Jan 7, 2026

0.1.6

Jan 5, 2026

0.1.4

Jan 3, 2026

0.1.3

Jan 3, 2026

0.1.2

Jan 3, 2026

0.1.1

Dec 30, 2025

0.0.103

Dec 10, 2025

0.0.102

Dec 10, 2025

0.0.101

Nov 5, 2025

0.0.100

Nov 4, 2025

0.0.99

Nov 4, 2025

0.0.98

Oct 29, 2025

0.0.97

Oct 26, 2025

0.0.96

Oct 26, 2025

0.0.95

Oct 25, 2025

0.0.94

Oct 25, 2025

0.0.93

Oct 25, 2025

0.0.92

Oct 25, 2025

0.0.91

Oct 25, 2025

0.0.90

Oct 25, 2025

0.0.89

Oct 25, 2025

0.0.88

Oct 25, 2025

0.0.87

Oct 25, 2025

0.0.85

Oct 24, 2025

0.0.84

Oct 24, 2025

0.0.83

Oct 24, 2025

0.0.82

Oct 23, 2025

0.0.81

Oct 23, 2025

0.0.80

Oct 23, 2025

0.0.79

Oct 23, 2025

0.0.78

Oct 23, 2025

0.0.77

Oct 23, 2025

0.0.76

Oct 23, 2025

0.0.75

Oct 23, 2025

0.0.74

Oct 21, 2025

0.0.73

Oct 21, 2025

0.0.72

Oct 21, 2025

0.0.71

Oct 21, 2025

0.0.70

Oct 21, 2025

0.0.69

Oct 21, 2025

0.0.68

Oct 21, 2025

0.0.67

Oct 20, 2025

0.0.66

Oct 20, 2025

0.0.65

Oct 13, 2025

0.0.64

Oct 13, 2025

0.0.63

Oct 8, 2025

0.0.62

Oct 8, 2025

0.0.61

Oct 7, 2025

0.0.60

Sep 29, 2025

0.0.59

Sep 19, 2025

0.0.58

Sep 19, 2025

0.0.57

Sep 19, 2025

0.0.56

Sep 19, 2025

0.0.55

Sep 19, 2025

0.0.54

Sep 19, 2025

0.0.53

Sep 18, 2025

0.0.52

Sep 18, 2025

0.0.51

Sep 18, 2025

0.0.50

Sep 18, 2025

0.0.43

Aug 8, 2025

0.0.42

Aug 8, 2025

0.0.41

Aug 8, 2025

0.0.40

Aug 8, 2025

0.0.39

Jul 23, 2025

This version

0.0.38

Jun 7, 2025

0.0.37

Jun 7, 2025

0.0.36

Jun 7, 2025

0.0.35

Jun 7, 2025

0.0.34

Jun 7, 2025

0.0.33

Jun 5, 2025

0.0.32

Jun 5, 2025

0.0.31

Jun 5, 2025

0.0.30

Jun 5, 2025

0.0.29

Jun 5, 2025

0.0.28

Jun 5, 2025

0.0.27

Jun 5, 2025

0.0.26

Jun 4, 2025

0.0.25

Jun 1, 2025

0.0.24

Jun 1, 2025

0.0.23

Jun 1, 2025

0.0.22

Jun 1, 2025

0.0.21

Jun 1, 2025

0.0.20

Jun 1, 2025

0.0.19

May 30, 2025

0.0.18

May 30, 2025

0.0.17

May 30, 2025

0.0.16

May 30, 2025

0.0.15

May 30, 2025

0.0.14

May 30, 2025

0.0.13

May 29, 2025

0.0.12

May 29, 2025

0.0.11

May 28, 2025

0.0.10

May 28, 2025

0.0.9

May 28, 2025

0.0.8

May 28, 2025

0.0.7

May 28, 2025

0.0.6

May 27, 2025

0.0.5

May 27, 2025

0.0.4

May 21, 2025

0.0.3

May 21, 2025

0.0.2

May 12, 2025

0.0.1

May 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cat_llm-0.0.38.tar.gz (318.1 kB view details)

Uploaded Jun 7, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cat_llm-0.0.38-py3-none-any.whl (321.5 kB view details)

Uploaded Jun 7, 2025 Python 3

File details

Details for the file cat_llm-0.0.38.tar.gz.

File metadata

Download URL: cat_llm-0.0.38.tar.gz
Upload date: Jun 7, 2025
Size: 318.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for cat_llm-0.0.38.tar.gz
Algorithm	Hash digest
SHA256	`645cb1e1e7af43bc448444a13293d4c69f1b39eb38cceece2caf34ceb0de58a4`
MD5	`1266c86e0a66f95b202604d995ae3e96`
BLAKE2b-256	`61142850353dbdd1156608e519f58a3736887390892a99da632863a8d9bbcde7`

See more details on using hashes here.

File details

Details for the file cat_llm-0.0.38-py3-none-any.whl.

File metadata

Download URL: cat_llm-0.0.38-py3-none-any.whl
Upload date: Jun 7, 2025
Size: 321.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for cat_llm-0.0.38-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5b56bc77502080684a5d9fbba93101824da8aa667f900ae51c0bf207edb7f188`
MD5	`ce16ce93006130c6d6401802828ec4ee`
BLAKE2b-256	`f9af55af153466b4947206d84b58553f0853e6670d26a641c885cd4e7b96e4a0`

See more details on using hashes here.

cat-llm 0.0.38

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

catllm

Table of Contents

Installation

Quick Start

Configuration

Get Your OpenAI API Key

Supported Models

API Reference

explore_corpus()

explore_common_categories()

multi_class()

image_multi_class()

image_score()

image_features()

cerad_drawn_score()

Academic Research

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`explore_corpus()`

`explore_common_categories()`

`multi_class()`

`image_multi_class()`

`image_score()`

`image_features()`

`cerad_drawn_score()`