Inverse Constitutional AI

These details have not been verified by PyPI

Project description

Inverse Constitutional AI

This repository contains the official implementation of the Inverse Constitutional AI (ICAI) algorithm [paper]. ICAI compresses pairwise preference datasets into a readable list of principles (constitution) that the annotations appear to follow (e.g. "select the friendlier response"). ICAI principles provide an interpretable overview of a feedback dataset, enabling users to discover problematic annotation biases or better understand differences between datasets, user groups or models.

Installation

Pip install the package (for development installation see here)
```
pip install git+https://github.com/rdnfn/icai.git
```
Set up API secrets: inside the main directory of the cloned repo (or wherever you like really) set up a secrets.toml file like below. You only need to include keys for APIs you want to use.
```
OPENAI_API_KEY="<YOURKEY>"
ANTHROPIC_API_KEY="<YOURKEY>"
```

Quickstart

You can run your first Inverse Constitutional AI (ICAI) experiment using the icai-exp command:

icai-exp data_path="data/processed/example/example.csv"

This will run the ICAI algorithm on the toy example.csv pairwise feedback dataset and generate a constitution for this dataset.

To get the available experiment parameters and instructions on how to adjust them, run

icai-exp --help

[!NOTE] If you want more control: the icai-exp command executes the run function inside ./src/inverse_cai/experiment/core.py. Edit that file, and the other parts of the inverse_cai Python package it uses, to fully adapt this code to your needs.

Inspecting results

By default all experiment results are saved in the ./outputs/<DATE>_<TIME> directory. The exact result file location is also printed to the console at the end of any (completed) icai-exp call. These outputs contain a full record of API calls, as well as intermediate states of the algorithm (proposed principles, clusters, distilled principles, etc.). Each result output follows the structure below:

./outputs
└── 2024-03-14
    └── 16-09-43
        ├── api_calls.jsonl
        ├── core.log
        ├── main.log
        └── results
            ├── 010_principles_per_comparison.csv
            ├── 011_principles_list.json
            ├── 020_principle_clusters.json
            ├── 030_distilled_principles_per_cluster.json
            ├── 040_votes_per_comparison.csv
            ├── 041_votes_per_cluster.json
            ├── 050_filtered_principles.json
            ├── 060_constitution.json
            ├── 092_results_training.csv
            └── 093_results_testset.json

Run experiment with your own data

To run ICAI on your dataset, you first need to convert it to a csv file with the following three columns: text_a,text_b,preferred_text. The first two should be strings. Note that the ICAI implementation currently uses no separate "prompt" column. If such a column exists in your dataset, you likely want to add the prompt to each response column (text_a,text_b) such that the ICAI algorithm can understand the full context of the preference label. Entries in the column preferred_text should take one of two values: "text_a" or "text_b". Ties or other annotation values are currently not used by the algorithm. To run ICAI on you prepared dataset, simply use:

icai-exp data_path="<path/to/your-data.csv>"

Run experiment from config file

In the exp/configs folder there is a number of configs to recreate experiments. You can run these experiments using the command:

icai-exp -cd ./exp/configs/<EXPERIMENT-DIR>

For example:

icai-exp -cd ./exp/configs/001_synthetic_orthogonal

[!NOTE] To re-run paper experiments: Look at the README file inside the exp/configs. This file gives detailed instructions on which configurations to run, and how to generate the corresponding plots.

Using cache to continue aborted experiment

[Experimental feature] Sometimes long-running (expensive) ICAI experiments get interrupted. Instead of requiring a full re-run, the ICAI package supports continuing certain experiments after they were interupted. This feature is only available for the voting stage of experiments: only for experiments that do not generate principles but use a pre-existing principle list to test.

To re-start an experiment with log dir exp/outputs/prior-experiment and config dir exp/configs/exp-config, use the following command:

icai-exp -cd exp/configs/prior-experiment prior_cache_path='exp/outputs/prior-experiment'

[!WARNING] Note that there is no strict config consistency check between cache and new experiment - thus use with caution, only using caches from prior experiments with identical configs.

Development

Dev installation

clone repo locally, e.g.

git clone git@github.com:rdnfn/icai.git

Then (inside repo folder) install package in editable mode:

pip install -e .

Running test cases

Tests are included as part of the package. Run them using:

pytest ./src

Simplest way to run experiment script

This doesn't do any meaningful experimental work but allows running the experiment script for testing purposes.

icai-exp generate_constitution=false annotator.constitution=null annotator.other_annotator_configs="[]"

Background

Motivation

Feedback data plays an important role in fine-tuning and evaluating state-of-the-art AI models. Often pairwise text preferences are used: given two texts, human (or AI) annotators select the “better” one. Such feedback data is widely used to align models to human preferences (e.g., reinforcement learning from human feedback), or to rank models according to human preferences (e.g., Chatbot Arena). Despite its wide-spread use, prior work has demonstrated that human-annotated pairwise text preference data often exhibits unintended biases. For example, human annotators have been shown to prefer assertive over truthful texts in certain contexts. Models trained or evaluated on this data may implicitly encode these biases in a manner hard to identify. To be able to better understand existing pairwise text preference data, we formulate its interpretation as a compression task: the Inverse Constitutional AI problem. Read the full paper for more background.

Method overview

The figure below provides an overview of the Inverse Constitutional AI (ICAI) problem we introduce: starting from a set of pairwise preference feedback, we derive a set of natural language principles (a constitution) that explain the preference data. For validation, we re-construct the original preferences with an LLM judging according to the generated constitution. The constitution represents a (highly compact) compression of the preferences.

Algorithm

We introduce a first Inverse Constitutional AI (ICAI) algorithm that generates a set of principles based on a feedback dataset. See the figure below for an overview of the algorithm. Given a dataset of pairwise rankings, in Step 1 candidate principles are generated using an LLM. In Step 2, these principles are clustered using an embedding model. In Step 3, similar principles are “de-duplicated” by sampling one principle per cluster. In Step 4, each principle is tested to evaluate its ability to help an LLM reconstruct the original annotations. Finally in Step 5, the principles are filtered according to the testing results and a set of filtered principles are returned as the final constitution. Optionally, this last step is augmented with additional clustering and subsampling steps to ensure diverse principles. The implementation is provided in this repository.

License

All code in this repo is licensed under Apache-2.0.

Project details

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.4.8

May 6, 2026

0.4.7

May 5, 2026

0.4.6

Nov 19, 2025

0.4.5

Aug 23, 2025

0.4.4

Aug 22, 2025

0.4.3

Aug 18, 2025

0.4.2

Jun 10, 2025

0.4.1

Jun 9, 2025

0.4.0

May 28, 2025

0.3.3

Apr 23, 2025

0.3.2

Apr 21, 2025

0.3.1

Apr 9, 2025

0.3.0

Apr 2, 2025

0.2.1

Mar 18, 2025

0.2.0

Mar 13, 2025

This version

0.1.3

Mar 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inverse_cai-0.1.3.tar.gz (4.4 MB view details)

Uploaded Mar 7, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

inverse_cai-0.1.3-py3-none-any.whl (95.9 kB view details)

Uploaded Mar 7, 2025 Python 3

File details

Details for the file inverse_cai-0.1.3.tar.gz.

File metadata

Download URL: inverse_cai-0.1.3.tar.gz
Upload date: Mar 7, 2025
Size: 4.4 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for inverse_cai-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`12d666244c64301a21c3b232015f78d2344d3a667ae52d59fecb0ed7b2a921cc`
MD5	`1103b4beddbd633cd0f815f491b25c24`
BLAKE2b-256	`183f629af2e489e8fa78a1ea376a6f90dfa57f8ebc6efb7c198f8981a3e15379`

See more details on using hashes here.

Provenance

The following attestation bundles were made for inverse_cai-0.1.3.tar.gz:

Publisher: publish-to-pypi.yml on rdnfn/icai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: inverse_cai-0.1.3.tar.gz
- Subject digest: 12d666244c64301a21c3b232015f78d2344d3a667ae52d59fecb0ed7b2a921cc
- Sigstore transparency entry: 178560445
- Sigstore integration time: Mar 7, 2025
Source repository:
- Permalink: rdnfn/icai@58bb6d8cce93697af0ea3b0f6d80a34698343f58
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/rdnfn
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@58bb6d8cce93697af0ea3b0f6d80a34698343f58
- Trigger Event: release

File details

Details for the file inverse_cai-0.1.3-py3-none-any.whl.

File metadata

Download URL: inverse_cai-0.1.3-py3-none-any.whl
Upload date: Mar 7, 2025
Size: 95.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for inverse_cai-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e72c6a06da8b5e156b2308749359694e4ff24ae9107269a9a533354668c9e4ee`
MD5	`0e1e5b287c06d17f9854d25f06f2a545`
BLAKE2b-256	`098cc5f395a92249d63e83b7d7e926d05447a720e160e7660f29128a2e618a0b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for inverse_cai-0.1.3-py3-none-any.whl:

Publisher: publish-to-pypi.yml on rdnfn/icai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: inverse_cai-0.1.3-py3-none-any.whl
- Subject digest: e72c6a06da8b5e156b2308749359694e4ff24ae9107269a9a533354668c9e4ee
- Sigstore transparency entry: 178560457
- Sigstore integration time: Mar 7, 2025
Source repository:
- Permalink: rdnfn/icai@58bb6d8cce93697af0ea3b0f6d80a34698343f58
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/rdnfn
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@58bb6d8cce93697af0ea3b0f6d80a34698343f58
- Trigger Event: release

inverse-cai 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Inverse Constitutional AI

Installation

Quickstart

Inspecting results

Run experiment with your own data

Run experiment from config file

Using cache to continue aborted experiment

Development

Dev installation

Running test cases

Simplest way to run experiment script

Background

Motivation

Method overview

Algorithm

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance