An LLM annotation experiment pipeline for computational social science.
Project description
CodeBook Lab
CodeBook Lab is an LLM annotation experiment pipeline for computational social science. It takes a codebook and labelled dataset from CodeBook Studio (source) and runs structured experiments across the dimensions that matter for text-as-data research: model choice, model size, prompt style, zero-shot versus few-shot learning, and sampling hyperparameters — all benchmarked against human labels.
Experiments are controlled through Python objects rather than by editing pipeline code. Because the codebook and labelled data stay constant across runs, each dimension can be isolated and compared against the same human labels.
For a step-by-step walkthrough covering both tools, see the CodeBook Studio & Lab Tutorial.
Contents
- How It Fits With CodeBook Studio
- Package Overview
- Quickstart
- Experiment Configuration
- Create Your Own Task
- Advanced Customization
- License
- Citation
How It Fits With CodeBook Studio
CodeBook Studio defines the task. CodeBook Lab runs and evaluates the experiment.
| CodeBook Studio | CodeBook Lab | |
|
Define the annotation task Annotate texts with humans Export codebook.jsonSave labeled data as ground-truth.csv
|
→ |
Strip label columns automatically Run LLM annotation experiments Sweep over models, prompts, and hyperparameters Evaluate outputs against human labels |
Package Overview
The package is organized around a small set of importable modules:
codebook_lab.experiments: high-level functions for single experiments and multi-run comparisonscodebook_lab.annotate: lower-level annotation functionscodebook_lab.metrics: evaluation and metrics functionscodebook_lab.prompts: prompt wrapper registry for built-in and custom prompt stylescodebook_lab.examples: helpers for bundled example taskscodebook_lab.types: dataclasses for experiment specifications and result objects
The package also ships with a bundled example task, policy-sentiment, so you can start experimenting immediately after installation.
Quickstart
1. Create a Python environment
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install codebook-lab
This installs CodeBook Lab from a package index so you can import it in your own scripts, notebooks, or analysis workflows.
If you plan to generate or score textbox annotations, install the optional textbox dependencies as well:
python -m pip install "codebook-lab[textbox]"
2. Install and start Ollama
Install Ollama on your machine, then make sure the local server is running:
ollama serve
If the default local Ollama server is not already running, CodeBook Lab will try to start it automatically when you run an experiment. It will also pull the requested Ollama model automatically if it is not already available locally.
3. Choose a model and task
The package ships with a bundled example task called policy-sentiment. Any Ollama model available on your machine can be used.
task = "policy-sentiment"
model = "gemma3:270m"
You can inspect or copy bundled example tasks from Python:
from codebook_lab import copy_example_task, list_example_tasks
print(list_example_tasks())
copy_example_task("policy-sentiment", "./my_tasks", overwrite=True)
Set country_iso_code to the country where the compute is physically running. This is used by CodeCarbon to convert energy use into emissions factors and should be a 3-letter ISO 3166-1 alpha-3 code such as USA, IRL, or DEU.
4. Run experiments from Python
Single experiment:
from codebook_lab import ExperimentSpec, run_experiment
result = run_experiment(
ExperimentSpec(
task="policy-sentiment",
model="gemma3:270m",
use_examples=False,
prompt_type="standard",
temperature=None,
top_p=None,
process_textbox=True,
country_iso_code="IRL",
),
output_root="outputs",
)
print(result.experiment_directory)
print(result.metrics.summary_text)
If process_textbox=True, CodeBook Lab will calculate textbox similarity metrics such as ROUGE, cosine similarity, and BERTScore when the optional textbox dependencies are installed. Without them, the run still completes, but textbox metrics that rely on those packages will be reported as unavailable and the warning will tell you how to install them.
Parameter sweep:
from codebook_lab import run_experiment_grid
results = run_experiment_grid(
param_grid={
"country_iso_code": "IRL",
"tasks": ["policy-sentiment"],
"models": ["gemma3:270m", "llama3.2:3b"],
"use_examples": ["false", "true"],
"prompt_types": ["standard", "persona"],
"temperatures": ["None", "0.2"],
"top_ps": ["None"],
"process_textboxes": ["true"],
},
output_root="outputs",
)
print(f"Completed {len(results)} runs")
Custom prompt wrapper:
from codebook_lab import ExperimentSpec, PromptContext, register_prompt_wrapper, run_experiment
def concise_wrapper(context: PromptContext) -> str:
return (
"Annotate the text as carefully as possible.\n\n"
f"{context.core_prompt}\n\n"
f'Text:\n"{context.text}"\n\n'
"Response:\n"
)
register_prompt_wrapper("concise", concise_wrapper)
result = run_experiment(
ExperimentSpec(
task="policy-sentiment",
model="gemma3:270m",
prompt_type="concise",
country_iso_code="IRL",
)
)
5. Inspect the outputs
Each run creates a timestamped experiment directory under outputs/<task>/ containing:
output.csv: row-level model annotationsconfig.json: the run configurationclassification_reports.txt: per-label evaluation summariesemissions.csv: CodeCarbon outputtiming_data.json: inference timing summarychar_counts.json: prompt and response character counts
Aggregate metrics are written to outputs/metrics/<task>_metrics_log.csv.
That metrics log stores both annotation-quality metrics and run metadata. Depending on the annotation type, it can include:
- classification metrics such as accuracy, precision, recall, F1, and percentage agreement
- inter-rater style agreement metrics such as Cohen's kappa and Krippendorff's alpha
- ordinal metrics for Likert labels such as Spearman correlation and quadratic weighted kappa
- textbox metrics such as normalized Levenshtein similarity, BLEU, ROUGE, cosine similarity, and BERTScore
- resource and run metadata such as CPU model, GPU model, total inference time, average inference time, total input characters, total output characters, energy consumed in kWh, and emissions in kg CO2eq
This makes it easy to compare not just which model is most accurate, but also which setup is fastest, cheapest to run, and most energy intensive.
Textbox note: normalized Levenshtein and BLEU work with the base install, but ROUGE, embedding-based cosine similarity, and BERTScore require the optional textbox extras. Install them with python -m pip install "codebook-lab[textbox]".
Experiment Configuration
Most multi-run setup happens through the parameter grid dictionary you pass into run_experiment_grid(...).
tasks: which task folders to runmodels: which Ollama models to evaluate (e.g.gemma3:270m,llama3.2:3b,qwen3.5:latest)use_examples: whether to include worked examples from the codebook in the LLM prompt (zero-shot vs. few-shot)prompt_types: which prompt wrapper to use (standard,persona, orCoT)temperatures: sampling temperature values (leave empty for model default)top_ps: nucleus sampling values (leave empty for model default)process_textboxes: whether textbox-style annotations should be generated and scored
When process_textboxes is enabled, install the optional textbox extras first if you want the full textbox metric suite:
python -m pip install "codebook-lab[textbox]"
Add multiple values to any field and the package sweeps them automatically. For a single quick run, keep one value in each field.
Create Your Own Task
- Create a local folder such as
my_tasks/my-task/. - Annotate your data in CodeBook Studio and save the labeled file as
my_tasks/my-task/ground-truth.csv. - Download the codebook JSON from Studio and save it as
my_tasks/my-task/codebook.json. - Pass
task_root="my_tasks"andtask="my-task"intoExperimentSpec(...)when you run experiments.
If you are still designing a task and do not yet have human-coded labels, you can run annotation with codebook_lab.run_annotation(...) on an unlabeled CSV and add ground-truth.csv later when you want to score model performance with codebook_lab.run_metrics(...).
Advanced Customization
If you want to go beyond the default wrappers and hyperparameters, codebook_lab/annotate.py and codebook_lab/prompts.py are the main extension points.
- To add new prompt wrappers beyond
standard,persona, andCoT, register them from Python withregister_prompt_wrapper(...)or extend the built-in registry incodebook_lab/prompts.py. - To expose additional model hyperparameters such as
top_k, add them tosetup_model(), thread them throughrun_annotation(...)andrun_experiment(...), and add the corresponding field to the grid you pass intorun_experiment_grid(...).
License
This project is licensed under the GNU Affero General Public License v3.0.
Citation
If you use CodeBook Lab in research, please cite both:
- this software package
- the associated preprint
Citation metadata is also available in the project's CITATION.cff.
Software Citation
APSR style:
McLaren, Lorcan. 2026. CodeBook Lab (Version v1.0.0) [Computer software]. Zenodo. https://doi.org/10.5281/zenodo.19185921.
BibTeX:
@software{mclaren_codebook_lab_2026,
author = {McLaren, Lorcan},
title = {CodeBook Lab},
year = {2026},
version = {v1.0.0},
doi = {10.5281/zenodo.19185921},
url = {https://doi.org/10.5281/zenodo.19185921}
}
Preprint Citation
APSR style:
McLaren, Lorcan, James P. Cross, Zuzanna Krakowska, Robin Rauner, and Martijn Schoonvelde. 2026. Magic Words or Methodical Work? Challenging Conventional Wisdom in LLM-Based Political Text Annotation. Preprint.
BibTeX:
@misc{mclaren_magic_words_2026,
author = {McLaren, Lorcan and Cross, James P. and Krakowska, Zuzanna and Rauner, Robin and Schoonvelde, Martijn},
title = {Magic Words or Methodical Work? Challenging Conventional Wisdom in LLM-Based Political Text Annotation},
year = {2026},
note = {Preprint}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file codebook_lab-1.1.0.tar.gz.
File metadata
- Download URL: codebook_lab-1.1.0.tar.gz
- Upload date:
- Size: 60.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
46a628c7fdd532542b2c0ce23a73fc7f813512f80ce39b07a92353ff0a3825c5
|
|
| MD5 |
128c93f9c730da2da9334add6143fcd2
|
|
| BLAKE2b-256 |
6f33954918415bb142ee58ec2b2f0e2ad931374f663ec02c49b1732d8b62e7dd
|
File details
Details for the file codebook_lab-1.1.0-py3-none-any.whl.
File metadata
- Download URL: codebook_lab-1.1.0-py3-none-any.whl
- Upload date:
- Size: 55.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6cd921f7cd79669be074eb791ad1f653bb57a74dfddfce0e86a472069b387597
|
|
| MD5 |
b50a71cfde73a7f0dd3f7dda84676a22
|
|
| BLAKE2b-256 |
9b83a7e5c7c6efc16cabb8c59557b978c3b197ad3500df39d3ca9117ecf1d430
|