Skip to main content

Corpus-level inductive thematic analysis via multi-LLM consensus labelling — a member of the lens analyser family.

Project description

thematic-analyser

Corpus-level inductive thematic analysis via multi-LLM consensus labelling — a member of the lens analyser family.

Most family members read one artefact for fixed signals. This one is the family's first corpus-level, inductive member: it takes a whole corpus and discovers a codebook. Like cite-sight it is auto_routable=False (a corpus isn't implied by a file extension).

The method

Harvested from a parked research project (Unveiling Risks in AI Systems, Borck & Thompson 2024 — see docs/method/). The novelty is not the topic model; it's what happens to its output:

  1. Topics — a pluggable, optional topic model proposes candidate themes (BERTopic via the [topics] extra, or bring your own precomputed topics). Mirrors BERTopic's clustering/representation split.
  2. Independent — two or more coders (different LLMs) label each topic blind, no peeking.
  3. Critique — coders see each other's labels and argue over N rounds, revising toward the most defensible shared label.
  4. Resolve — converged label if they agree; otherwise the majority of the final round, flagged agreed=False for a human to settle.
  5. Reliability — Krippendorff's α (the [irr] extra) over the blind labels — the defensibility number. Percent-agreement fallback otherwise.
  6. Codebook — a flat set of themes the human groups into a hierarchy (apply_hierarchy), exportable to REFI-QDA for QualCoder/NVivo/ATLAS.ti.

The human sets the hierarchy; the machine does the labelling and the bookkeeping.

Install

uv venv && uv pip install -e '../lens-contract' -e '.[dev]'
uv run pytest                       # offline smoke (stub coders, no API key)

uv pip install -e '.[topics]'       # + fit topics from raw text (BERTopic)
uv pip install -e '.[llm]'          # + real LLM coders (anthropic)
uv pip install -e '.[irr]'          # + Krippendorff's alpha
uv pip install -e '.[documents]'    # + .pdf/.docx ingestion via document-analyser

CLI

thematic-analyser corpus.txt                      # fit topics, stub coders, human summary
thematic-analyser corpus.txt --topics topics.json # skip fitting; use precomputed topics
thematic-analyser corpus/ --rounds 3 --json       # directory of docs; JSON to stdout
thematic-analyser serve --port 8017               # HTTP API
thematic-analyser manifest                        # capability manifest

Bare positional = analyse. --json prints the ThematicAnalysis model and nothing else; diagnostics go to stderr.

Python

from thematic_analyser import ThematicAnalyser, LLMCoder

# Real two-model consensus (needs the [llm] extra + ANTHROPIC_API_KEY):
coders = [
    LLMCoder("claude", "claude-opus-4-8", context="jailbreak prompts"),
    LLMCoder("haiku",  "claude-haiku-4-5-20251001", context="jailbreak prompts"),
]
result = ThematicAnalyser(coders, rounds=3).analyse("corpus.txt", topics="topics.json")
print(result.reliability)            # Krippendorff's alpha on the blind labels
print([(c.label, c.agreed) for c in result.consensus])

Without coders it defaults to two offline stub coders so everything runs with no API key — that's what the test suite uses.

HTTP

thematic-analyser serve --port 8017
curl -F file=@corpus.txt -F rounds=3 'http://127.0.0.1:8017/analyse'
curl http://127.0.0.1:8017/health

GET /health, GET /manifest, POST /analyse (multipart corpus upload). The HTTP face runs the cheap stub-coder default; the LLM tier and human-in-the-loop curation live in the desktop app, which calls the Python surface directly.

Status

v0.1 scaffold. Working offline path (corpus → topics → consensus → reliability → codebook → REFI-QDA export). Seams still to flesh out: BERTopic fitting ([topics]), real provider wiring beyond Anthropic, a full .qdpx writer, and the local desktop curation app (forked from the debrief/insight-lens shell).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thematic_analyser-0.1.0.tar.gz (13.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

thematic_analyser-0.1.0-py3-none-any.whl (19.2 kB view details)

Uploaded Python 3

File details

Details for the file thematic_analyser-0.1.0.tar.gz.

File metadata

  • Download URL: thematic_analyser-0.1.0.tar.gz
  • Upload date:
  • Size: 13.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for thematic_analyser-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7d701d08e3d3512ac06bdc8730b7caf7ffa406bd0cf99c0c8b02361ba2e6b8fb
MD5 3235b7365653231f9d2037fd4b792028
BLAKE2b-256 5e0201bdb0cca0228a36640774809d23dcacbfa2c5a9d935be2cc4022bd501b9

See more details on using hashes here.

File details

Details for the file thematic_analyser-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for thematic_analyser-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ba601390e67a4bc937d9aa195803a629fcf27940fb6fb5487a5bc832ef2fd5d5
MD5 2f9cebb40a6f207cebd6af4075da79cf
BLAKE2b-256 6711d11266a0a2dcd19f84b47679dd42206d17ed52790a72825f919bf41d3c8d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page