An LLM annotation experiment pipeline for computational social science.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

LorcanMcLaren

These details have not been verified by PyPI

Project links

Project description

CodeBook Lab

CodeBook Lab is an LLM annotation experiment pipeline for computational social science. It takes a codebook and labelled dataset from CodeBook Studio (source) and runs structured experiments across the dimensions that matter for text-as-data research: model choice, model size, prompt style, zero-shot versus few-shot learning, and sampling hyperparameters — all benchmarked against human labels.

Experiments are controlled through Python objects rather than by editing pipeline code. Because the codebook and labelled data stay constant across runs, each dimension can be isolated and compared against the same human labels.

For a step-by-step walkthrough covering both tools, see the CodeBook Studio & Lab Tutorial.

How It Fits With CodeBook Studio
Package Overview
Quickstart
Experiment Configuration
Create Your Own Task
Advanced Customization
License
Citation

How It Fits With CodeBook Studio

CodeBook Studio defines the task. CodeBook Lab runs and evaluates the experiment.

CodeBook Studio		CodeBook Lab
Define the annotation task Annotate texts with humans Export `codebook.json` Save labeled data as `ground-truth.csv`	→	Strip label columns automatically Run LLM annotation experiments Sweep over models, prompts, and hyperparameters Evaluate outputs against human labels

Package Overview

The package is organized around a small set of importable modules:

codebook_lab.experiments: high-level functions for single experiments and multi-run comparisons
codebook_lab.annotate: lower-level annotation functions
codebook_lab.metrics: evaluation and metrics functions
codebook_lab.human_reliability: human coder validation, ICR, disagreement, and ground-truth helpers
codebook_lab.prompts: prompt wrapper registry for built-in and custom prompt styles
codebook_lab.examples: helpers for bundled example tasks
codebook_lab.types: dataclasses for experiment specifications and result objects

The package also ships with a bundled example task, policy-sentiment, so you can start experimenting immediately after installation.

Quickstart

1. Create a Python environment

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install codebook-lab

This installs CodeBook Lab from a package index so you can import it in your own scripts, notebooks, or analysis workflows.

If you plan to generate or score textbox annotations, install the optional textbox dependencies as well:

python -m pip install "codebook-lab[textbox]"

2. Install and start Ollama

Install Ollama on your machine, then make sure the local server is running:

ollama serve

If the default local Ollama server is not already running, CodeBook Lab will try to start it automatically when you run an experiment. It will also pull the requested Ollama model automatically if it is not already available locally.

3. Choose a model and task

The package ships with a bundled example task called policy-sentiment. Any Ollama model available on your machine can be used.

task = "policy-sentiment"
model = "gemma3:270m"

You can inspect or copy bundled example tasks from Python:

from codebook_lab import copy_example_task, list_example_tasks

print(list_example_tasks())
copy_example_task("policy-sentiment", "./my_tasks", overwrite=True)

Set country_iso_code to the country where the compute is physically running. This is used by CodeCarbon to convert energy use into emissions factors and should be a 3-letter ISO 3166-1 alpha-3 code such as USA, IRL, or DEU.

4. Run experiments from Python

Single experiment:

from codebook_lab import ExperimentSpec, run_experiment

result = run_experiment(
    ExperimentSpec(
        task="policy-sentiment",
        model="gemma3:270m",
        chat_mode="per_text",
        reasoning=None,
        process_textbox=True,
        country_iso_code="IRL",
    ),
    output_root="outputs",
)

print(result.experiment_directory)
print(result.metrics.summary_text)

If process_textbox=True, CodeBook Lab will calculate textbox similarity metrics such as ROUGE, cosine similarity, and BERTScore when the optional textbox dependencies are installed. Without them, the run still completes, but textbox metrics that rely on those packages will be reported as unavailable and the warning will tell you how to install them.

Parameter sweep:

from codebook_lab import run_experiment_grid

results = run_experiment_grid(
    param_grid={
        "country_iso_code": "IRL",
        "tasks": ["policy-sentiment"],
        "models": ["gemma3:270m", "llama3.2:3b"],
        "use_examples": ["true"],
        "prompt_types": ["standard", "persona"],
        "temperatures": ["0", "0.2"],
        "top_ps": ["None"],
        "chat_modes": ["per_text", "per_query"],
        "reasoning": ["None", "false"],
        "process_textboxes": ["true"],
        "process_spans": ["false"],
    },
    output_root="outputs",
)

print(f"Completed {len(results)} runs")

Custom prompt wrapper:

from codebook_lab import ExperimentSpec, PromptContext, register_prompt_wrapper, run_experiment

def concise_wrapper(context: PromptContext) -> str:
    return (
        "Annotate the text as carefully as possible.\n\n"
        f"{context.core_prompt}\n\n"
        f'Text:\n"{context.text}"\n\n'
        "Response:\n"
    )

register_prompt_wrapper("concise", concise_wrapper)

result = run_experiment(
    ExperimentSpec(
        task="policy-sentiment",
        model="gemma3:270m",
        prompt_type="concise",
        country_iso_code="IRL",
    )
)

5. Inspect the outputs

Each run creates a run-ID experiment directory under outputs/<task>/<run_id>/ containing:

output.csv: row-level model annotations
config.json: the run configuration
classification_reports.txt: per-label evaluation summaries
emissions.csv: CodeCarbon output
timing_data.json: inference timing summary
char_counts.json: prompt and response character counts
reasoning_traces.jsonl: per-query reasoning content when the model returns it

Aggregate metrics are written to outputs/metrics/<task>_metrics_log.csv, with run-level metadata in outputs/metrics/<task>_metrics_log_runs.csv. Use run_id to connect a run folder, its config.json, and its rows in the aggregate metrics tables.

That metrics log stores both annotation-quality metrics and run metadata. Depending on the annotation type, it can include:

classification metrics such as accuracy, precision, recall, F1, and percentage agreement
inter-rater style agreement metrics such as Cohen's kappa and Krippendorff's alpha
ordinal metrics for Likert labels such as Spearman correlation and quadratic weighted kappa
textbox metrics such as normalized Levenshtein similarity, BLEU, ROUGE, cosine similarity, and BERTScore
run metadata such as prompt type, example use, chat mode, reasoning mode, CPU model, GPU model, total inference time, average inference time, total input characters, total output characters, energy consumed in kWh, and emissions in kg CO2eq

This makes it easy to compare not just which model is most accurate, but also which setup is fastest, cheapest to run, and most energy intensive.

Textbox note: normalized Levenshtein and BLEU work with the base install, but ROUGE, embedding-based cosine similarity, and BERTScore require the optional textbox extras. Install them with python -m pip install "codebook-lab[textbox]".

Experiment Configuration

Most multi-run setup happens through the parameter grid dictionary you pass into run_experiment_grid(...).

tasks: which task folders to run
models: which Ollama models to evaluate (e.g. gemma3:270m, llama3.2:3b, qwen3.5:latest)
use_examples: whether to include worked examples from the codebook in the LLM prompt (zero-shot vs. few-shot)
prompt_types: which prompt wrapper to use (standard, persona, or CoT)
temperatures: sampling temperature values (0 is the default for classification)
top_ps: nucleus sampling values (leave empty for model default)
chat_modes: whether model calls use a fresh chat per query (per_query), one chat per text row (per_text, the default), or one continuous chat for the whole run (continuous)
reasoning: Ollama reasoning mode (true, false, or None for the model default)
process_textboxes: whether textbox-style annotations should be generated and scored
process_spans: whether span annotations should be generated and scored

When process_textboxes is enabled, install the optional textbox extras first if you want the full textbox metric suite:

python -m pip install "codebook-lab[textbox]"

Add multiple values to any field and the package sweeps them automatically. For a single quick run, keep one value in each field.

Create Your Own Task

Create a local folder such as my_tasks/my-task/.
Annotate your data in CodeBook Studio and save the labeled file as my_tasks/my-task/ground-truth.csv.
Download the codebook JSON from Studio and save it as my_tasks/my-task/codebook.json.
Pass task_root="my_tasks" and task="my-task" into ExperimentSpec(...) when you run experiments.

If you are still designing a task and do not yet have human-coded labels, you can run annotation with codebook_lab.run_annotation(...) on an unlabeled CSV and add ground-truth.csv later when you want to score model performance with codebook_lab.run_metrics(...).

Human Reliability And Adjudication

When multiple human coders annotate the same items, CodeBook Lab can validate the coder CSVs, calculate inter-coder reliability, find disagreements, and build a consensus ground-truth.csv.

from codebook_lab import build_human_ground_truth, calculate_human_reliability

coder_csvs = {
    "coder1": "annotations/coder1.csv",
    "coder2": "annotations/coder2.csv",
    "coder3": "annotations/coder3.csv",
}

reliability = calculate_human_reliability(
    codebook_path="codebook.json",
    coder_csvs=coder_csvs,
    output_dir="outputs/human_reliability",
)

ground_truth = build_human_ground_truth(
    codebook_path="codebook.json",
    coder_csvs=coder_csvs,
    output_dir="outputs/ground_truth",
)

Each coder CSV must contain a stable item identifier column. The default is sample_id; pass id_column="..." to use a different column. By default, coder assignments are inferred from the submitted files. To validate expected coverage, pass an optional assignment CSV in either long format (sample_id,coder_id) or wide format (sample_id,ra_1,ra_2,...).

Reliability outputs include validation_issues.csv, pairwise_icr.csv, multirater_icr.csv, disagreements.csv, and summary.md. Ground-truth outputs include ground-truth.csv, adjudication_queue.csv, and validation_issues.csv.

Rows without a strict majority are written to adjudication_queue.csv. Open that queue in CodeBook Studio's adjudication mode, fill the unresolved blanks, export the completed queue, then rebuild:

resolved = build_human_ground_truth(
    codebook_path="codebook.json",
    coder_csvs=coder_csvs,
    adjudications_csv="adjudication_queue.csv",
    output_dir="outputs/ground_truth_resolved",
)

Advanced Customization

If you want to go beyond the default wrappers and hyperparameters, codebook_lab/annotate.py and codebook_lab/prompts.py are the main extension points.

To add new prompt wrappers beyond standard, persona, and CoT, register them from Python with register_prompt_wrapper(...) or extend the built-in registry in codebook_lab/prompts.py.
To expose additional model hyperparameters such as top_k, add them to setup_model(), thread them through run_annotation(...) and run_experiment(...), and add the corresponding field to the grid you pass into run_experiment_grid(...).

License

This project is licensed under the GNU Affero General Public License v3.0.

Citation

If you use CodeBook Lab in research, please cite both:

this software package
the associated arXiv preprint

Citation metadata is also available in the project's CITATION.cff.

Software Citation

APSR style:

McLaren, Lorcan. 2026. CodeBook Lab (Version v1.4.0) [Computer software]. Zenodo. https://doi.org/10.5281/zenodo.19185921.

BibTeX:

@software{mclaren_codebook_lab_2026,
  author = {McLaren, Lorcan},
  title = {CodeBook Lab},
  year = {2026},
  version = {v1.4.0},
  doi = {10.5281/zenodo.19185921},
  url = {https://doi.org/10.5281/zenodo.19185921}
}

Preprint Citation

APSR style:

McLaren, Lorcan, James P. Cross, Zuzanna Krakowska, Robin Rauner, and Martijn Schoonvelde. 2026. Magic Words or Methodical Work? Challenging Conventional Wisdom in LLM-Based Political Text Annotation. arXiv preprint arXiv:2603.26898. https://arxiv.org/abs/2603.26898.

BibTeX:

@misc{mclaren_magic_words_2026,
  author = {McLaren, Lorcan and Cross, James P. and Krakowska, Zuzanna and Rauner, Robin and Schoonvelde, Martijn},
  title = {Magic Words or Methodical Work? Challenging Conventional Wisdom in LLM-Based Political Text Annotation},
  year = {2026},
  eprint = {2603.26898},
  archivePrefix = {arXiv},
  primaryClass = {cs.CL},
  doi = {10.48550/arXiv.2603.26898},
  url = {https://arxiv.org/abs/2603.26898}
}

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

LorcanMcLaren

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.4.0

Jun 25, 2026

1.3.0

Jun 24, 2026

1.2.0

Jun 24, 2026

1.1.1

Jun 8, 2026

1.1.0

Mar 23, 2026

1.0.0

Mar 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

codebook_lab-1.4.0.tar.gz (89.6 kB view details)

Uploaded Jun 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

codebook_lab-1.4.0-py3-none-any.whl (77.3 kB view details)

Uploaded Jun 25, 2026 Python 3

File details

Details for the file codebook_lab-1.4.0.tar.gz.

File metadata

Download URL: codebook_lab-1.4.0.tar.gz
Upload date: Jun 25, 2026
Size: 89.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for codebook_lab-1.4.0.tar.gz
Algorithm	Hash digest
SHA256	`862c31ee8a35f69e041de1f5a9578645e1bbebb9eb6eb440d7c9927f5f514555`
MD5	`74597c4e1e73659feb70fd4452bc6357`
BLAKE2b-256	`b1f3da94e9341d9a6daf05b13971404a755ebba16ea6a9f99a495c55de7f7d65`

See more details on using hashes here.

Provenance

The following attestation bundles were made for codebook_lab-1.4.0.tar.gz:

Publisher: publish.yml on LorcanMcLaren/codebook-lab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: codebook_lab-1.4.0.tar.gz
- Subject digest: 862c31ee8a35f69e041de1f5a9578645e1bbebb9eb6eb440d7c9927f5f514555
- Sigstore transparency entry: 1953757738
- Sigstore integration time: Jun 25, 2026
Source repository:
- Permalink: LorcanMcLaren/codebook-lab@9ca9abd36be8770ec81fa489b7510a709758298d
- Branch / Tag: refs/tags/v1.4.0
- Owner: https://github.com/LorcanMcLaren
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@9ca9abd36be8770ec81fa489b7510a709758298d
- Trigger Event: push

File details

Details for the file codebook_lab-1.4.0-py3-none-any.whl.

File metadata

Download URL: codebook_lab-1.4.0-py3-none-any.whl
Upload date: Jun 25, 2026
Size: 77.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for codebook_lab-1.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`76b2c21b5174d701c278e3777654d90b8a40221ac023f6abfff829e323319488`
MD5	`1a72cc0c5cdf8dff21102e4e83984323`
BLAKE2b-256	`64e698a080c3849bc7577334ca5e9d22b1aab76e61fb4e667c42ff0f6434e87f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for codebook_lab-1.4.0-py3-none-any.whl:

Publisher: publish.yml on LorcanMcLaren/codebook-lab

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: codebook_lab-1.4.0-py3-none-any.whl
- Subject digest: 76b2c21b5174d701c278e3777654d90b8a40221ac023f6abfff829e323319488
- Sigstore transparency entry: 1953758579
- Sigstore integration time: Jun 25, 2026
Source repository:
- Permalink: LorcanMcLaren/codebook-lab@9ca9abd36be8770ec81fa489b7510a709758298d
- Branch / Tag: refs/tags/v1.4.0
- Owner: https://github.com/LorcanMcLaren
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@9ca9abd36be8770ec81fa489b7510a709758298d
- Trigger Event: push

codebook-lab 1.4.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CodeBook Lab

Contents

How It Fits With CodeBook Studio

Package Overview

Quickstart

1. Create a Python environment

2. Install and start Ollama

3. Choose a model and task

4. Run experiments from Python

5. Inspect the outputs

Experiment Configuration

Create Your Own Task

Human Reliability And Adjudication

Advanced Customization

License

Citation

Software Citation

Preprint Citation

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance