soak: graph-based pipelines and tools for LLM-assisted qualitative text analysis
Project description
Get to saturation faster!
soak is a tool to enable qualitative researchers to rapidly define and run llm-assisted text analysis pipelines and thematic analysis.
The easiest way to see what soak does is to see sample outputs from the system.
The Zero-shot pipeline diagram shows the various stages the analysis involves:
Input text from patient interviews:
Sample theme extracted:
Matching LLM extracted quotes to source text to detect hallucinations:
A classification prompt, extracting structured data from transcripts. The green element is the templated input. The blue elements like [[this]] indicate the LLM-completions. Prompts are written in struckdown, is a simple text-based format used to constrain the LLM output to a specific data type/structure.
Inter-rater agreement and ground truth validation statistics, calculated for structured data extracted from transcripts:
Ground truth validation: Classifier nodes can automatically validate LLM outputs against ground truth labels, calculating precision, recall, F1, and confusion matrices:
ground_truths:
reflection:
existing: reflection_exists # Ground truth column
mapping: {yes: 1, no: 0} # Map LLM outputs to GT values
See Ground Truth Validation for details.
Plots and similarity statistics quantify the similarity between sets of themes created by different analyses. For example we might compare different LLMs, different datasets (patients vs doctors) or different prompts (amending the research question posed to the LLM). The heatmap reveals common themes between different analyses or datasets:
Similarity statistics quantify the similarity between sets of themes created by different analyses.
Sample outouts
-
cfs1_simple.html shows a thematic analysis of transcripts of 8 patients with ME/CFS or Long COVID.
-
cfs2_pipeline.html shows the same analysis using a different LLM model, and in extended HTML format.
-
comparison.html shows the comparison of these two analyses.
-
20251008_085446_5db6_pipeline.html shows the result of a different pipeline extracting structured data from the transcripts (results are also available as json and csv).
Example pipeline specifications
-
soak/pipelines/zs.soak is the Zero-shot pipeline used in the sample outputs above.
-
classifier.soak is the classifier pipeline used in the sample output above.
Quick Start
# install
git clone https://github.com/benwhalley/soak
uv install . tool
# set credentials, using openai for simplicity
export LLM_API_KEY=your_api_key
export LLM_API_BASE=https://api.openai.com/v1
# Run analysis
soak zs soak/data/cfs/*.txt -t simple -o cfs-simple-1
# Open results in a browser
open cfs-simple-1_simple.html
# Re-run with a different/better model
soak zs -o cfs-simple-2 --model-name="openai/gpt-4o" soak/data/cfs/*.txt
# Compare results
soak compare cfs-simple-1.json cfs-simple-2.json -o comparison.html
More usage
# Basic pattern
uv run soak <pipeline> <files> --output <name>
# Run demo pipeline on sample text files
uv run soak demo --output demo_analysis soak/data/cfs/*.txt
# Use the 'simple' html output template
uv run soak zs -t simple --output analysis_simple soak/data/cfs/*.txt
Working with CSV/XLSX Spreadsheets
CSV and XLSX files are fully supported. Each row becomes a separate document, with column values accessible in templates as {{column_name}}.
Example data (soak/data/test_data.csv):
participant_id,age,condition,response
P001,25,control,I felt very relaxed during the session
P002,32,treatment,The intervention helped me focus better
Run classifier on CSV:
uv run soak classifier_tabular --output csv_analysis soak/data/test_data.csv
Pipeline template accessing columns:
# pipeline.soak
nodes:
- name: analyze
type: Map
inputs: [documents]
---#analyze
Participant {{participant_id}} (age {{age}}, {{condition}} group):
{{response}}
Summarize the response: [[summary:str]]
Sampling options:
# Process first 10 rows only (useful for testing)
uv run soak classifier_tabular --head 10 --output test_run survey.csv
# Randomly sample 50 rows
uv run soak classifier_tabular --sample 50 --output pilot survey.csv
See Working with Spreadsheet Data for more details.
Common Options:
--output, -o: Output filename (generates .json dump file and .html)--model-name: LLM model (default: gpt-4o-mini)-c, --context: Pipeline context variables (e.g.,-c research_question="Experiences of patients with COVID-19")
Documentation
See CLAUDE.md for architecture details.
License
AGPL v3 or later
Please cite: Ben Whalley. (2025). benwhalley/soak: Initial public release (v0.3.0). Zenodo. https://doi.org/10.5281/zenodo.17293023
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file soaking-0.3.5.tar.gz.
File metadata
- Download URL: soaking-0.3.5.tar.gz
- Upload date:
- Size: 1.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4b515cc0cb39eba5a2d42329f140913aba0b4e99a47457b97dc96f4508c858ba
|
|
| MD5 |
7e79eb33c7924a63be326894bb545cdf
|
|
| BLAKE2b-256 |
cd0d69b1abca6e530fb060d10c20c09fd6a9cac25ccbd09532dbe8312eaabff1
|
File details
Details for the file soaking-0.3.5-py3-none-any.whl.
File metadata
- Download URL: soaking-0.3.5-py3-none-any.whl
- Upload date:
- Size: 1.3 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a454e39500cd330b00f52907469f7f785bb4f1c5ed6e9583ce057133f258700a
|
|
| MD5 |
d7852fb83dd085500892b6c36833eca0
|
|
| BLAKE2b-256 |
a44dec6a9403dddabe249a26ffe49d543c71d0bda897f008312083a2401c8add
|