Skip to main content

Corpus-level inductive thematic analysis via multi-LLM consensus labelling — a member of the lens analyser family.

Project description

thematic-analyser

Corpus-level inductive thematic analysis via multi-LLM consensus labelling — a member of the lens analyser family.

Most family members read one artefact for fixed signals. This one is the family's first corpus-level, inductive member: it takes a whole corpus and discovers a codebook. Like cite-sight it is auto_routable=False (a corpus isn't implied by a file extension).

The method

Harvested from a parked research project (Unveiling Risks in AI Systems, Borck & Thompson 2024 — see docs/method/). The novelty is not the topic model; it's what happens to its output:

  1. Topics — a pluggable, optional topic model proposes candidate themes (BERTopic via the [topics] extra, or bring your own precomputed topics). Mirrors BERTopic's clustering/representation split.
  2. Independent — two or more coders (different LLMs) label each topic blind, no peeking.
  3. Critique — coders see each other's labels and argue over N rounds, revising toward the most defensible shared label.
  4. Resolve — converged label if they agree; otherwise the majority of the final round, flagged agreed=False for a human to settle.
  5. Reliability — Krippendorff's α (the [irr] extra) over the blind labels — the defensibility number. Percent-agreement fallback otherwise.
  6. Codebook — a flat set of themes the human groups into a hierarchy (apply_hierarchy), exportable to REFI-QDA for QualCoder/NVivo/ATLAS.ti.

The human sets the hierarchy; the machine does the labelling and the bookkeeping.

Install

uv venv && uv pip install -e '../lens-contract' -e '.[dev]'
uv run pytest                       # offline smoke (stub coders, no API key)

uv pip install -e '.[topics]'       # + fit topics from raw text (BERTopic)
uv pip install -e '.[llm]'          # + real LLM coders (anthropic)
uv pip install -e '.[irr]'          # + Krippendorff's alpha
uv pip install -e '.[documents]'    # + .pdf/.docx ingestion via document-analyser

CLI

thematic-analyser corpus.txt                      # fit topics, stub coders, human summary
thematic-analyser corpus.txt --topics topics.json # skip fitting; use precomputed topics
thematic-analyser corpus/ --rounds 3 --json       # directory of docs; JSON to stdout
thematic-analyser serve --port 8017               # HTTP API
thematic-analyser manifest                        # capability manifest

Bare positional = analyse. --json prints the ThematicAnalysis model and nothing else; diagnostics go to stderr.

Python

from thematic_analyser import ThematicAnalyser, LLMCoder

# Real two-model consensus (needs the [llm] extra + ANTHROPIC_API_KEY):
coders = [
    LLMCoder("claude", "claude-opus-4-8", context="jailbreak prompts"),
    LLMCoder("haiku",  "claude-haiku-4-5-20251001", context="jailbreak prompts"),
]
result = ThematicAnalyser(coders, rounds=3).analyse("corpus.txt", topics="topics.json")
print(result.reliability)            # Krippendorff's alpha on the blind labels
print([(c.label, c.agreed) for c in result.consensus])

Without coders it defaults to two offline stub coders so everything runs with no API key — that's what the test suite uses.

HTTP

thematic-analyser serve --port 8017
curl -F file=@corpus.txt -F rounds=3 'http://127.0.0.1:8017/analyse'
curl http://127.0.0.1:8017/health

GET /health, GET /manifest, POST /analyse (multipart corpus upload). The HTTP face runs the cheap stub-coder default; the LLM tier and human-in-the-loop curation live in the desktop app, which calls the Python surface directly.

Status

v0.1 scaffold. Working offline path (corpus → topics → consensus → reliability → codebook → REFI-QDA export). Seams still to flesh out: BERTopic fitting ([topics]), real provider wiring beyond Anthropic, a full .qdpx writer, and the local desktop curation app (forked from the debrief/insight-lens shell).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thematic_analyser-0.2.0.tar.gz (16.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

thematic_analyser-0.2.0-py3-none-any.whl (22.3 kB view details)

Uploaded Python 3

File details

Details for the file thematic_analyser-0.2.0.tar.gz.

File metadata

  • Download URL: thematic_analyser-0.2.0.tar.gz
  • Upload date:
  • Size: 16.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for thematic_analyser-0.2.0.tar.gz
Algorithm Hash digest
SHA256 c65c7e8f1740f9cfec6ebe7c9ec94d51274888a1d98fcf1d9572a6c1320efe3f
MD5 07bebc06306b4c65f6fdb389afaa6e8c
BLAKE2b-256 077d44992b5ea098a96085d976b1e6fcf7a60f9c13aa1b5d237e6fda35058fa4

See more details on using hashes here.

File details

Details for the file thematic_analyser-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for thematic_analyser-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cc7b03e9e29c8f2efda26dc3fa3c0b3795ce1cd57b0a35c3ed17f3d2e625c65a
MD5 931ebbb723af9f84c1ed08d475c394ff
BLAKE2b-256 39d9c2756c8741ed83c1eb44f372b2081db9d2d13f128be29ced244d22a3f58e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page