Skip to main content

Using LLMs for rating items on a scale, comparative judgments, and multiple choice questions.

Project description

ChoiceLLM: A Python package for ratings and multiple choice answers from LLMs

Purpose

This tool makes it easier to prompt an LLM for:

  • 'scalar' judgments (e.g., "On a scale from 1 to 7, how abstract is this word?", "How suggestive is this question?").
  • 'comparative' judgments (e.g., "Of these 4 words, which is the most emotional?")
  • 'categorical' judgments, i.e., to classify words, phrases or paragraphs into a number of fixed, predetermined categories (e.g., "What type of speech act is this utterance?").

and to do so for a large amount of 'stimuli', and using different LLMs (both locally running LLMs, and models via the OpenAI API, if you have an OPENAI_API_KEY).

The scores obtained are deterministic, obtained by accessing the underlying logits computed by the model (as opposed to sampling a concrete response).

Installation

From PyPI:

pip install choicellm

Installation makes available the main program choicellm, and helper programs choicellm-template and choicellm-aggregate.

First create a suitable prompt JSON file (choicellm-template)

When running choicellm, you need to provide it with a 'prompt template' file in JSON format, like choicellm --prompt my_first_prompt.json. The exact format of this file will depend on what you want to use choicellm for, i.e., for scalar judgments (and if so, on what kind of scale), for multiple-choice questions, or for comparing items, and whether you plan to use a 'base' model (i.e., plain language generation) or an 'instruct' model (i.e., chatbot like GPT).

To generate an example prompt template (which you can then modify to suit your needs), you can do, for instance:

choicellm-template --scalar > my_first_prompt.json

Alternative modes are --comparative and --categorical (see below for some explanation, but you'll understand most from the generated prompt templates themselves). If you want to use an 'instruct' model, additionally specify --chat. (Note that the examples were written for a study on concreteness vs. abstractness; you can of course modify all of this.)

When modifying a prompt, make sure to use all the parts in curly braces, like {scale}. These will be replaced by the program, according to the information you provide.

The prompt templates contain some additional comments under the JSON key _comments.

Basic usage

Now, assuming you have some items (words, phrases) to be categorized (or whatever) in a plain text file items.txt, one on each line, you can now do:

choicellm items.txt --prompt my_first_prompt.json

Results will simply be printed to the command-line (along with some logging info). To save them, use output redirection:

choicellm items.txt --prompt my_first_prompt.json > results.csv

The default model won't be very big, or very good. If you have a more powerful computer, you could try something larger with the --model option, accepting any model identifier from the Huggingface hub.

choicellm items.txt --model 'unsloth/llama-3-70b-bnb-4bit' --prompt my_first_prompt.json > results.csv

To use a proprietary OpenAI model, first, you need to set the OPENAI_API_KEY environment variable to your secret key:

export OPENAI_API_KEY=yoursecretkey123

Then you can specify an OpenAI model (and include the --openai flag):

choicellm items.txt --model 'gpt-4o' --openai --my_first_prompt.json > results.csv

Since GPT-4o is an 'instruct' (i.e., chat) model, make sure you use it with a prompt template created with the --chat option, e.g., choicellm-template --scalar --chat.

On the whole, while choicellm is probably not entirely model-agnostic in some unforeseen ways, it should work with Huggingface models and OpenAI models. Reasonable local options that fit in a 48GB GPU: unsloth/llama-3-70b-bnb-4bit or unsloth/Llama-3.3-70B-Instruct-bnb-4bit. Since the latter is an 'instruct' model, again make sure to use a prompt template generated with the --chat option.

A bit more info about the three 'modes': scalar, comparative, categorical

First, --scalar prompts ask the LLM to give ratings on a numerical scale. Beware that negative scale values (like a scale -1, 0, 1, which could make sense for sentiment) are currently supported only for local models with transformers, not through the OpenAI client. (This applies more generally to labels that do not map onto single token ids for the given tokenizer.) The results.csv file contains the items, with additional columns pred (the model's choice), prob (the probability the model assigned to this choice, i.e., the softmaxed logits), rating (a weighted sum taking the model's probabilities for all choices into account), probs (the probabilities for all choices, separated by semicolons; these sum to 1.0).

Next, --categorical prompts are most suited for single-label categorization: the category probabilities sum to one (under the hood, the LLM is not given the option of listing several categories). (For multi-label categorization, see below.) For this mode, the results.csv file contains pred (the 'chosen' category), prob (its probability) and probs (probabilities for all categories, separated by semicolons).

Third, --comparative prompts ask the model to compare each item to a large number of random other items, which can result in more reliable per-item scores (once aggregated over all comparisons). Applying choicellm to such a prompt comes with some additional options, see choicellm --help for that. (Publication with more explanation pending.) Note that, for a --comparative prompt, the results.csv file will not contain a single score per item, but rather a separate row for each of the many comparisons it did. The auxiliary command choicellm-aggregate may be used to aggregate these comparisons into a single score per item (in this case resulting in values on the scale 1,5):

choicellm-aggregate results.csv --scale 1,5 > results_aggregated.csv

Lastly, to achieve multi-label classification (where the same instance may fit several categories at once), the recommended approach is to implement this as multiple, separate --scalar (or --comparative) prompts, one for each category, and fed separately into choicellm. After creating separate prompt files (each named as <category>.json), feeding them into choicellm can be easily done in Bash by looping over the prompt files as follows (e.g.):

for category in games movies music sports; do
    choicellm items.txt --model 'unsloth/Meta-Llama-3.1-70B-Instruct-bnb-4bit' --prompt "$category.json" > "$category.csv"
done

Having a separate prompt template per category (with designated system prompt and few-shot examples for each category) is probably a good idea regardless, as it encourages fine-tuning and evaluating the system for each category separately. The .csv results files (one per category) need to be merged afterward, which is fairly straightforward using (for instance) the pandas library.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

choicellm-0.2.tar.gz (20.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

choicellm-0.2-py3-none-any.whl (22.2 kB view details)

Uploaded Python 3

File details

Details for the file choicellm-0.2.tar.gz.

File metadata

  • Download URL: choicellm-0.2.tar.gz
  • Upload date:
  • Size: 20.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for choicellm-0.2.tar.gz
Algorithm Hash digest
SHA256 14c26a02a9205769492cbb31ffbe8d6b85a7b48d9adddcdd2d7d96eb702a1dec
MD5 ed8bc7832616b5219a6e34c318162f6e
BLAKE2b-256 3a6687a24b234b591336a241be91df0950f5228b8c2efc50c35130090853b091

See more details on using hashes here.

File details

Details for the file choicellm-0.2-py3-none-any.whl.

File metadata

  • Download URL: choicellm-0.2-py3-none-any.whl
  • Upload date:
  • Size: 22.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for choicellm-0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 3277c924f43a1038ee04fd875e94fd90bfbadb8ec94096eba2fd1a59c1f1af6a
MD5 7442b1f7ae8685b0a174cefb434c7623
BLAKE2b-256 ca6da2104eb2916a217763d8c1985280845c1218deb6fcdbdd1da84861f242d6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page