Using LLMs for rating items on a scale, comparative judgments, and multiple choice questions.

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence
- Text Processing :: Linguistic

Project description

ChoiceLLM: A Python package for ratings and multiple choice answers from LLMs

Purpose

This tool makes it easier to prompt an LLM for:

'scalar' judgments (e.g., "On a scale from 1 to 7, how abstract is this word?", "How suggestive is this question?").
'comparative' judgments (e.g., "Of these 4 words, which is the most emotional?")
'categorical' judgments, i.e., to classify words, phrases or paragraphs into a number of fixed, predetermined categories (e.g., "What type of speech act is this utterance?").

and to do so for a large amount of 'stimuli', and using different LLMs (both locally running LLMs, and models via the OpenAI API, if you have an OPENAI_API_KEY).

The scores obtained are deterministic, obtained by accessing the underlying logits computed by the model (as opposed to sampling a concrete response).

Installation

From PyPI:

pip install choicellm

Installation makes available the main program choicellm, and helper programs choicellm-template and choicellm-aggregate.

First create a suitable prompt JSON file (`choicellm-template`)

When running choicellm, you need to provide it with a 'prompt template' file in JSON format, like choicellm --prompt my_first_prompt.json. The exact format of this file will depend on what you want to use choicellm for, i.e., for scalar judgments (and if so, on what kind of scale), for multiple-choice questions, or for comparing items, and whether you plan to use a 'base' model (i.e., plain language generation) or an 'instruct' model (i.e., chatbot like GPT).

To generate an example prompt template (which you can then modify to suit your needs), you can do, for instance:

choicellm-template --scalar > my_first_prompt.json

Alternative modes are --comparative and --categorical (see below for some explanation, but you'll understand most from the generated prompt templates themselves). If you want to use an 'instruct' model, additionally specify --chat. (Note that the examples were written for a study on concreteness vs. abstractness; you can of course modify all of this.)

When modifying a prompt, make sure to use all the parts in curly braces, like {scale}. These will be replaced by the program, according to the information you provide.

The prompt templates contain some additional comments under the JSON key _comments.

Basic usage

Now, assuming you have some items (words, phrases) to be categorized (or whatever) in a plain text file items.txt, one on each line, you can now do:

choicellm items.txt --prompt my_first_prompt.json

Results will simply be printed to the command-line (along with some logging info). To save them, use output redirection:

choicellm items.txt --prompt my_first_prompt.json > results.csv

The default model won't be very big, or very good. If you have a more powerful computer, you could try something larger with the --model option, accepting any model identifier from the Huggingface hub.

choicellm items.txt --model 'unsloth/llama-3-70b-bnb-4bit' --prompt my_first_prompt.json > results.csv

To use a proprietary OpenAI model, first, you need to set the OPENAI_API_KEY environment variable to your secret key:

export OPENAI_API_KEY=yoursecretkey123

Then you can specify an OpenAI model (and include the --openai flag):

choicellm items.txt --model 'gpt-4o' --openai --my_first_prompt.json > results.csv

Since GPT-4o is an 'instruct' (i.e., chat) model, make sure you use it with a prompt template created with the --chat option, e.g., choicellm-template --scalar --chat.

On the whole, while choicellm is probably not entirely model-agnostic in some unforeseen ways, it should work with Huggingface models and OpenAI models. Reasonable local options that fit in a 48GB GPU: unsloth/llama-3-70b-bnb-4bit or unsloth/Llama-3.3-70B-Instruct-bnb-4bit. Since the latter is an 'instruct' model, again make sure to use a prompt template generated with the --chat option.

A bit more info about the three 'modes': scalar, comparative, categorical

First, --scalar prompts ask the LLM to give ratings on a numerical scale. Beware that negative scale values (like a scale -1, 0, 1, which could make sense for sentiment) are currently supported only for local models with transformers, not through the OpenAI client. (This applies more generally to labels that do not map onto single token ids for the given tokenizer.) The results.csv file contains the items, with additional columns pred (the model's choice), prob (the probability the model assigned to this choice, i.e., the softmaxed logits), rating (a weighted sum taking the model's probabilities for all choices into account), probs (the probabilities for all choices, separated by semicolons; these sum to 1.0).

Next, --categorical prompts are most suited for single-label categorization: the category probabilities sum to one (under the hood, the LLM is not given the option of listing several categories). (For multi-label categorization, see below.) For this mode, the results.csv file contains pred (the 'chosen' category), prob (its probability) and probs (probabilities for all categories, separated by semicolons).

Third, --comparative prompts ask the model to compare each item to a large number of random other items, which can result in more reliable per-item scores (once aggregated over all comparisons). Applying choicellm to such a prompt comes with some additional options, see choicellm --help for that. (Publication with more explanation pending.) Note that, for a --comparative prompt, the results.csv file will not contain a single score per item, but rather a separate row for each of the many comparisons it did. The auxiliary command choicellm-aggregate may be used to aggregate these comparisons into a single score per item (in this case resulting in values on the scale 1,5):

choicellm-aggregate results.csv --scale 1,5 > results_aggregated.csv

Lastly, to achieve multi-label classification (where the same instance may fit several categories at once), the recommended approach is to implement this as multiple, separate --scalar (or --comparative) prompts, one for each category, and fed separately into choicellm. After creating separate prompt files (each named as <category>.json), feeding them into choicellm can be easily done in Bash by looping over the prompt files as follows (e.g.):

for category in games movies music sports; do
    choicellm items.txt --model 'unsloth/Meta-Llama-3.1-70B-Instruct-bnb-4bit' --prompt "$category.json" > "$category.csv"
done

Having a separate prompt template per category (with designated system prompt and few-shot examples for each category) is probably a good idea regardless, as it encourages fine-tuning and evaluating the system for each category separately. The .csv results files (one per category) need to be merged afterward, which is fairly straightforward using (for instance) the pandas library.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence
- Text Processing :: Linguistic

Release history Release notifications | RSS feed

This version

0.2.1

Jun 25, 2025

0.2

Jun 22, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

choicellm-0.2.1-py3-none-any.whl (22.2 kB view details)

Uploaded Jun 25, 2025 Python 3

File details

Details for the file choicellm-0.2.1-py3-none-any.whl.

File metadata

Download URL: choicellm-0.2.1-py3-none-any.whl
Upload date: Jun 25, 2025
Size: 22.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for choicellm-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ba956ae7d5e50aaa3cd7b3b443f1acf3a6647f26070d063a6b13b4648ce18a36`
MD5	`c4eb8ae5e2acdd4ccb38952d255b9a07`
BLAKE2b-256	`6c64d407e0997231187febe5a696c13622d58f3b9807a0d844e4c52e327636c2`

See more details on using hashes here.

choicellm 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ChoiceLLM: A Python package for ratings and multiple choice answers from LLMs

Purpose

Installation

First create a suitable prompt JSON file (`choicellm-template`)

Basic usage

A bit more info about the three 'modes': scalar, comparative, categorical

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes

choicellm 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ChoiceLLM: A Python package for ratings and multiple choice answers from LLMs

Purpose

Installation

First create a suitable prompt JSON file (choicellm-template)

Basic usage

A bit more info about the three 'modes': scalar, comparative, categorical

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes

First create a suitable prompt JSON file (`choicellm-template`)