Skip to main content

soak: graph-based pipelines and tools for LLM-assisted qualitative text analysis

Project description

Get to saturation faster!

soak is a tool to enable qualitative researchers to rapidly define and run llm-assisted text analysis pipelines and thematic analysis.

The easiest way to see what soak does is to see sample outputs from the system.

The Zero-shot pipeline diagram shows the various stages the analysis involves:

an analysis pipeline

Input text from patient interviews:

raw data

Sample theme extracted:

themes extracted

Matching LLM extracted quotes to source text to detect hallucinations:

alt text

A classification prompt, extracting structured data from transcripts. The green element is the templated input. The blue elements like [[this]] indicate the LLM-completions. Prompts are written in struckdown, is a simple text-based format used to constrain the LLM output to a specific data type/structure.

A struckdown prompt

Inter-rater agreement and ground truth validation statistics, calculated for structured data extracted from transcripts:

IRR statistics

Ground truth validation: Classifier nodes can automatically validate LLM outputs against ground truth labels, calculating precision, recall, F1, and confusion matrices:

ground_truths:
  reflection:
    existing: reflection_exists  # Ground truth column
    mapping: {yes: 1, no: 0}     # Map LLM outputs to GT values

See Ground Truth Validation for details.

Plots and similarity statistics quantify the similarity between sets of themes created by different analyses. For example we might compare different LLMs, different datasets (patients vs doctors) or different prompts (amending the research question posed to the LLM). The heatmap reveals common themes between different analyses or datasets:

heatmap

Similarity statistics quantify the similarity between sets of themes created by different analyses.

similarity statistics

Sample outouts

  • cfs1_simple.html shows a thematic analysis of transcripts of 8 patients with ME/CFS or Long COVID.

  • cfs2_pipeline.html shows the same analysis using a different LLM model, and in extended HTML format.

  • comparison.html shows the comparison of these two analyses.

  • 20251008_085446_5db6_pipeline.html shows the result of a different pipeline extracting structured data from the transcripts (results are also available as json and csv).

Example pipeline specifications

Quick Start

# install
git clone https://github.com/benwhalley/soak
uv install . tool

# set credentials, using openai for simplicity
export LLM_API_KEY=your_api_key
export LLM_API_BASE=https://api.openai.com/v1

# Run analysis
soak zs soak/data/cfs/*.txt -t simple -o cfs-simple-1

# Open results in a browser
open cfs-simple-1_simple.html

# Re-run with a different/better model
soak zs -o cfs-simple-2 --model-name="openai/gpt-4o" soak/data/cfs/*.txt

# Compare results
soak compare cfs-simple-1.json cfs-simple-2.json -o comparison.html

More usage

# Basic pattern
uv run soak <pipeline> <files> --output <name>

# Run demo pipeline on sample text files
uv run soak demo --output demo_analysis soak/data/cfs/*.txt

# Use the 'simple' html output template
uv run soak zs -t simple --output analysis_simple soak/data/cfs/*.txt

Working with CSV/XLSX Spreadsheets

CSV and XLSX files are fully supported. Each row becomes a separate document, with column values accessible in templates as {{column_name}}.

Example data (soak/data/test_data.csv):

participant_id,age,condition,response
P001,25,control,I felt very relaxed during the session
P002,32,treatment,The intervention helped me focus better

Run classifier on CSV:

uv run soak classifier_tabular --output csv_analysis soak/data/test_data.csv

Pipeline template accessing columns:

# pipeline.soak
nodes:
  - name: analyze
    type: Map
    inputs: [documents]
---#analyze
Participant {{participant_id}} (age {{age}}, {{condition}} group):
{{response}}

Summarize the response: [[summary:str]]

Sampling options:

# Process first 10 rows only (useful for testing)
uv run soak classifier_tabular --head 10 --output test_run survey.csv

# Randomly sample 50 rows
uv run soak classifier_tabular --sample 50 --output pilot survey.csv

See Working with Spreadsheet Data for more details.

Common Options:

  • --output, -o: Output filename (generates .json dump file and .html)
  • --model-name: LLM model (default: gpt-4o-mini)
  • -c, --context: Pipeline context variables (e.g., -c research_question="Experiences of patients with COVID-19")

Documentation

See CLAUDE.md for architecture details.

License

AGPL v3 or later

Please cite: Ben Whalley. (2025). benwhalley/soak: Initial public release (v0.3.0). Zenodo. https://doi.org/10.5281/zenodo.17293023

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

soaking-0.3.5.tar.gz (1.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

soaking-0.3.5-py3-none-any.whl (1.3 MB view details)

Uploaded Python 3

File details

Details for the file soaking-0.3.5.tar.gz.

File metadata

  • Download URL: soaking-0.3.5.tar.gz
  • Upload date:
  • Size: 1.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for soaking-0.3.5.tar.gz
Algorithm Hash digest
SHA256 4b515cc0cb39eba5a2d42329f140913aba0b4e99a47457b97dc96f4508c858ba
MD5 7e79eb33c7924a63be326894bb545cdf
BLAKE2b-256 cd0d69b1abca6e530fb060d10c20c09fd6a9cac25ccbd09532dbe8312eaabff1

See more details on using hashes here.

File details

Details for the file soaking-0.3.5-py3-none-any.whl.

File metadata

  • Download URL: soaking-0.3.5-py3-none-any.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for soaking-0.3.5-py3-none-any.whl
Algorithm Hash digest
SHA256 a454e39500cd330b00f52907469f7f785bb4f1c5ed6e9583ce057133f258700a
MD5 d7852fb83dd085500892b6c36833eca0
BLAKE2b-256 a44dec6a9403dddabe249a26ffe49d543c71d0bda897f008312083a2401c8add

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page