soak: graph-based pipelines and tools for LLM-assisted qualitative text analysis

Project description

Get to saturation faster!

soak is a tool to enable qualitative researchers to rapidly define and run llm-assisted text analysis pipelines and thematic analysis.

The easiest way to see what soak does is to see sample outputs from the system.

The Zero-shot pipeline diagram shows the various stages the analysis involves:

an analysis pipeline

Input text from patient interviews:

raw data

Sample theme extracted:

themes extracted

Matching LLM extracted quotes to source text to detect hallucinations:

alt text

A classification prompt, extracting structured data from transcripts. The green element is the templated input. The blue elements like [[this]] indicate the LLM-completions. Prompts are written in struckdown, is a simple text-based format used to constrain the LLM output to a specific data type/structure.

A struckdown prompt

Inter-rater agreement and ground truth validation statistics, calculated for structured data extracted from transcripts:

IRR statistics

Ground truth validation: Classifier nodes can automatically validate LLM outputs against ground truth labels, calculating precision, recall, F1, and confusion matrices:

ground_truths:
  reflection:
    existing: reflection_exists  # Ground truth column
    mapping: {yes: 1, no: 0}     # Map LLM outputs to GT values

See Ground Truth Validation for details.

Plots and similarity statistics quantify the similarity between sets of themes created by different analyses. For example we might compare different LLMs, different datasets (patients vs doctors) or different prompts (amending the research question posed to the LLM). The heatmap reveals common themes between different analyses or datasets:

heatmap

Similarity statistics quantify the similarity between sets of themes created by different analyses.

similarity statistics

Sample outouts

cfs1_simple.html shows a thematic analysis of transcripts of 8 patients with ME/CFS or Long COVID.
cfs2_pipeline.html shows the same analysis using a different LLM model, and in extended HTML format.
comparison.html shows the comparison of these two analyses.
20251008_085446_5db6_pipeline.html shows the result of a different pipeline extracting structured data from the transcripts (results are also available as json and csv).

Example pipeline specifications

soak/pipelines/zs.soak is the Zero-shot pipeline used in the sample outputs above.
classifier.soak is the classifier pipeline used in the sample output above.

Quick Start

# install
git clone https://github.com/benwhalley/soak
uv install . tool

# set credentials, using openai for simplicity
export LLM_API_KEY=your_api_key
export LLM_API_BASE=https://api.openai.com/v1

# Run analysis
soak zs soak/data/cfs/*.txt -t simple -o cfs-simple-1

# Open results in a browser
open cfs-simple-1_simple.html

# Re-run with a different/better model
soak zs -o cfs-simple-2 --model-name="openai/gpt-4o" soak/data/cfs/*.txt

# Compare results
soak compare cfs-simple-1.json cfs-simple-2.json -o comparison.html

More usage

# Basic pattern
uv run soak <pipeline> <files> --output <name>

# Run demo pipeline on sample text files
uv run soak demo --output demo_analysis soak/data/cfs/*.txt

# Use the 'simple' html output template
uv run soak zs -t simple --output analysis_simple soak/data/cfs/*.txt

Working with CSV/XLSX Spreadsheets

CSV and XLSX files are fully supported. Each row becomes a separate document, with column values accessible in templates as {{column_name}}.

Example data (soak/data/test_data.csv):

participant_id,age,condition,response
P001,25,control,I felt very relaxed during the session
P002,32,treatment,The intervention helped me focus better

Run classifier on CSV:

uv run soak classifier_tabular --output csv_analysis soak/data/test_data.csv

Pipeline template accessing columns:

# pipeline.soak
nodes:
  - name: analyze
    type: Map
    inputs: [documents]
---#analyze
Participant {{participant_id}} (age {{age}}, {{condition}} group):
{{response}}

Summarize the response: [[summary:str]]

Sampling options:

# Process first 10 rows only (useful for testing)
uv run soak classifier_tabular --head 10 --output test_run survey.csv

# Randomly sample 50 rows
uv run soak classifier_tabular --sample 50 --output pilot survey.csv

See Working with Spreadsheet Data for more details.

Common Options:

--output, -o: Output filename (generates .json dump file and .html)
--model-name: LLM model (default: gpt-4o-mini)
-c, --context: Pipeline context variables (e.g., -c research_question="Experiences of patients with COVID-19")

Documentation

See CLAUDE.md for architecture details.

License

AGPL v3 or later

Please cite: Ben Whalley. (2025). benwhalley/soak: Initial public release (v0.3.0). Zenodo. https://doi.org/10.5281/zenodo.17293023

Project details

Release history Release notifications | RSS feed

0.11.0

May 25, 2026

0.10.2

May 20, 2026

0.10.1

May 20, 2026

0.6.11

Mar 13, 2026

0.6.10

Mar 4, 2026

0.6.9

Mar 4, 2026

0.6.8

Feb 27, 2026

0.6.7 yanked

Feb 27, 2026

0.6.6 yanked

Feb 27, 2026

0.6.5

Feb 25, 2026

0.6.4 yanked

Feb 23, 2026

0.6.1

Feb 21, 2026

0.5.4

Feb 20, 2026

0.5.1

Feb 20, 2026

0.4.0

Feb 13, 2026

0.3.15

Jan 28, 2026

0.3.14

Jan 26, 2026

0.3.13 yanked

Jan 26, 2026

0.3.12

Jan 26, 2026

0.3.11

Jan 26, 2026

0.3.9 yanked

Jan 26, 2026

0.3.8 yanked

Jan 26, 2026

0.3.7 yanked

Jan 26, 2026

This version

0.3.5

Jan 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

soaking-0.3.5.tar.gz (1.2 MB view details)

Uploaded Jan 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

soaking-0.3.5-py3-none-any.whl (1.3 MB view details)

Uploaded Jan 26, 2026 Python 3

File details

Details for the file soaking-0.3.5.tar.gz.

File metadata

Download URL: soaking-0.3.5.tar.gz
Upload date: Jan 26, 2026
Size: 1.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for soaking-0.3.5.tar.gz
Algorithm	Hash digest
SHA256	`4b515cc0cb39eba5a2d42329f140913aba0b4e99a47457b97dc96f4508c858ba`
MD5	`7e79eb33c7924a63be326894bb545cdf`
BLAKE2b-256	`cd0d69b1abca6e530fb060d10c20c09fd6a9cac25ccbd09532dbe8312eaabff1`

See more details on using hashes here.

File details

Details for the file soaking-0.3.5-py3-none-any.whl.

File metadata

Download URL: soaking-0.3.5-py3-none-any.whl
Upload date: Jan 26, 2026
Size: 1.3 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for soaking-0.3.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a454e39500cd330b00f52907469f7f785bb4f1c5ed6e9583ce057133f258700a`
MD5	`d7852fb83dd085500892b6c36833eca0`
BLAKE2b-256	`a44dec6a9403dddabe249a26ffe49d543c71d0bda897f008312083a2401c8add`

See more details on using hashes here.

soaking 0.3.5

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Get to saturation faster!

Sample outouts

Example pipeline specifications

Quick Start

More usage

Working with CSV/XLSX Spreadsheets

Documentation

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes