Investigative journalism document intelligence โ drop records, find connections
Project description
๐๐ Watchdog
Investigative journalism document intelligence โ drop records, find connections.
Watchdog is a Claude Code tool for journalists who accumulate large sets of public records. Drop documents into a folder. Watchdog reads every page, extracts every person, company, address, and relationship it finds, stores them as linked notes in an Obsidian vault, and proactively surfaces connections you might have missed.
Alpha. Core pipeline works. Tested on macOS with real investigation documents. Not yet battle-hardened for production use. Feedback and contributions welcome.
โ ๏ธ Public records only
Watchdog is designed exclusively for publicly available documents โ court filings, corporate registrations, government contracts, regulatory filings, land registry records, and similar public-interest material.
Do not use Watchdog with:
- Confidential source communications
- Unpublished tips or leaked documents
- Private correspondence
- Any material that could identify a confidential source
- Documents obtained under a promise of confidentiality
Every document Watchdog processes is read by an AI. There is no way to take that back. If you are unsure whether a document is safe to process, do not process it.
What it does
- Ingests anything โ PDFs (scanned or text), Word documents, spreadsheets, images, court documents, corporate filings, financial statements, and more, powered by Docling
- Extracts entities โ people, companies, addresses, properties, court cases, transactions โ with page-level citations and confidence levels on every fact
- Builds timelines โ datable events are extracted per entity and assembled into a global chronological view across the entire investigation
- Finds connections โ shared addresses, overlapping directors, unusual role combinations, entities appearing across unrelated documents
- Flags contradictions โ when a new document conflicts with a known fact (different address, conflicting date, mismatched role), Watchdog adds a
[!contradiction]callout to the entity note with both sources cited - Tracks session state โ
hot.mdis rewritten after every ingest with a current-state summary so Claude can orient itself instantly at the start of a new session without re-reading the vault - Logs every ingest โ
log.mdis a human-readable append-only record of every ingest session, visible in Obsidian - Seeds investigation context โ drop prior published stories into
_CONTEXT/and Watchdog interviews you to build a richcontext.mdthat orients every subsequent ingest - Handles large documents โ 400+ page PDFs are split and processed in parallel; no truncation
- Auto-OCRs scanned documents โ detects missing or garbled text layers and applies OCR automatically; falls back to encrypted/malformed PDF repair
- Preserves provenance โ every extracted fact, timeline event, and relationship links to the source document and page; every vault note is directly linked to the original file
- Domain knowledge built in โ dedicated extraction skills for corporate filings, court documents, real estate records, financial statements, bankruptcy filings, and government contracts
- Stores everything in Obsidian โ your vault is yours; Watchdog writes to it, you query and annotate it
How it works
Drop file into _INCOMING/
โ
watchdog preprocess โ SHA-256 dedup ยท OCR detection ยท Docling extraction ยท near-duplicate check
โ
Claude extracts entities, relationships, timeline events, and key facts
โ
watchdog write-vault writes everything atomically:
entity notes ยท document notes ยท global timeline ยท registries ยท morgue move
โ
Post-ingest briefing: new entities ยท connections ยท leads ยท anomalies
The ingest pipeline is a Claude Code skill โ Claude reads the document, applies domain knowledge, and produces a structured extraction JSON. The Python pipeline handles the mechanical work (OCR, hashing, similarity detection, vault writes). You keep the Obsidian vault and every original file.
Docling
Watchdog uses Docling for all document conversion. Docling is an open-source document understanding library from IBM Research that extracts text, tables, and layout from PDFs, Word documents, spreadsheets, HTML, and images.
Why Docling matters for investigative work:
- Table extraction โ financial statements and creditor lists are full of tables. Docling reconstructs them as structured data rather than garbled text, so Claude can reason about rows and columns correctly.
- Layout awareness โ multi-column layouts, footnotes, headers, and sidebars are handled correctly. A court document's header fields don't bleed into the body text.
- OCR integration โ when text extraction fails or produces garbled output, Docling falls back to OCR automatically. On macOS, Apple Vision is used (fast, hardware-accelerated); on other platforms, Tesseract is the default (install via
brew install tesseractorapt install tesseract-ocr). The engine is configurable โ see Configuration. - Large document handling โ 400+ page PDFs are chunked into 40-page segments, processed in parallel, and reassembled in order with correct page numbers throughout.
Docling runs locally. Your documents never leave your machine during preprocessing.
Requirements
- macOS 12+ โ Linux supported; Windows via WSL2
- Obsidian v1.6+ โ free
- Claude Code โ free to install
- Claude.ai Pro or Max subscription โ required (Pro ~$20/month; Max from $100/month)
- Python 3.10+
- qpdf + Ghostscript โ PDF decryption and repair
- Tesseract OCR โ Linux/Windows only (macOS uses Apple Vision)
A Claude.ai Pro subscription is the recommended starting point. No API key setup, no per-token billing.
Installation
pipx install watchdog-intel
watchdog setup
watchdog setup installs the Claude Code skills, verifies system dependencies (qpdf, Ghostscript, Tesseract on Linux), and configures shell completions. Takes 5โ10 minutes on first run โ Docling downloads its ML models and fastembed downloads the embedding model (~50 MB, one-time).
For step-by-step instructions written for journalists who have never used a terminal, see INSTALL.md.
Quick start
# Create a new investigation vault
watchdog new "Shell Company Investigation"
# Open the vault in Claude Code
watchdog open shell-company-investigation
Optional but recommended: before ingesting records, seed your investigation context from prior published stories or notes:
- Drop background files (clips, notes, screenshots) into
_CONTEXT/ - Run
/watchdog-contextโ Watchdog reads the material, asks you questions, and writescontext.md
Then drop public records into _INCOMING/. At the start of every Claude Code session, Watchdog automatically checks for new files and ingests them. You can also trigger ingest manually:
/watchdog-ingest
/watchdog-ingest specific-document.pdf
Commands
| Command | What it does |
|---|---|
/watchdog-context |
Seed context.md from background files in _CONTEXT/ |
/watchdog-ingest |
Process all files in _INCOMING/ |
/watchdog-ingest [file] |
Process a specific file |
/watchdog-entity [id ...] |
Refresh entity Summary and Timeline from all source documents |
/watchdog-query [question] |
Answer a question from your vault |
/watchdog-surface |
Find connections and anomalies across the full vault |
/watchdog-wiki |
Create or update investigation thread pages |
/watchdog-health |
Check vault integrity โ orphaned notes, broken links, registry mismatches |
CLI search (outside the vault, from any terminal):
watchdog search <investigation> "<query>"
watchdog search my-investigation "offshore account transfers"
watchdog search my-investigation "shell company director" --top 10
Returns the top matching results โ raw document pages (with file name and page number) and vault notes (entity and document summaries) โ ranked by semantic similarity. The index is built automatically during /watchdog-ingest; no separate step required.
The first ingest after installation triggers a one-time ~50 MB model download (the BAAI/bge-small-en-v1.5 embedding model via fastembed). Subsequent runs use the cached model.
Query examples:
/watchdog-query Who are the directors of Shell Co Ltd?
/watchdog-query Which companies share the address 123 Main St?
/watchdog-query What happened in 2019 involving Alice Smith?
/watchdog-surface
Vault structure
Each investigation is an independent Obsidian vault:
my-investigation/
โโโ _INCOMING/ โ Drop public records here
โ โโโ _FAILED/ โ Files that could not be processed
โโโ _CONTEXT/ โ Background material (prior stories, notes)
โโโ morgue/ โ Original files after successful ingest
โ โโโ <entity>/
โ โโโ <doc-type>/
โโโ .watchdog/
โ โโโ Registry/ โ Internal state โ do not edit manually
โ โโโ entities.json
โ โโโ documents.json
โ โโโ manifest.json โ Lightweight entity lookup index
โ โโโ registry.json
โ โโโ ingest.log
โโโ entities/
โ โโโ person/ โ One note per person
โ โโโ company/ โ One note per company
โ โโโ address/ โ One note per address
โโโ documents/ โ One note per ingested document
โโโ briefings/ โ Post-ingest briefing notes
โโโ wiki/ โ Investigation thread pages
โโโ timeline.md โ Global chronological view across all entities
โโโ hot.md โ Current session state โ rewritten after every ingest
โโโ log.md โ Append-only human-readable ingest history
โโโ context.md โ Your investigation intent and key questions
โโโ index.md โ Dataview index
Entity notes
Each entity note has a consistent structure:
## Summaryโ synthesized overview of who this entity is and their significance; replaced on each ingest## Analysisโ accumulated investigative observations, dated and linked to source documents; never overwritten## Timelineโ chronological list of datable events involving this entity, linked to source pages## Relationshipsโ connections to other entities, with source citations## Notesโ reserved for journalist annotations; never touched by Watchdog
Every link to a source document includes a direct page link into the original file ([[morgue/.../file.pdf#page=3|p. 3]]), so you can jump from any fact straight to the page it came from.
Domain knowledge skills
Watchdog ships with extraction skills for 34 document types. When Claude identifies a matching document, it loads the relevant skill before extracting โ applying journalist expertise about what to look for, what constitutes a red flag, and what fields matter.
Skills are jurisdiction-agnostic by default: universal principles come first, with specific jurisdictions (Canada, US, UK, Australia, EU) treated as examples, not as defaults.
Financial and corporate
| Skill | Covers |
|---|---|
records/corporate-filings |
Annual reports, registrations, director filings, beneficial ownership |
records/financial-statements |
Audited statements, MD&A, auditor opinions, related-party disclosures |
records/regulatory-filings |
Securities disclosures, insider trading reports, SEDAR+/EDGAR filings |
records/bankruptcy |
Bankruptcy filings, creditor lists, trustee reports, restructuring proceedings |
records/insurance-filings |
Regulatory returns, actuarial reports, reinsurance treaties, market conduct reviews |
records/tax-documents |
Charity information returns (T3010, Form 990), nonprofit filings, trust returns |
Legal and regulatory
| Skill | Covers |
|---|---|
records/court-documents |
Civil claims, affidavits, judgments, orders, injunctions |
records/criminal-proceedings |
Charging documents, bail decisions, trial decisions, sentencing, forfeiture orders |
records/administrative-tribunals |
Quasi-judicial administrative bodies: human rights, competition, environmental review, privacy, utility regulation |
records/labour-arbitration |
Grievance awards, labour board decisions, unfair labour practices, collective agreements |
records/immigration-refugee |
Asylum decisions, detention reviews, deportation orders, judicial reviews |
records/healthcare-licensing |
Discipline decisions, fitness to practise, facility inspections (medicine, nursing, pharmacy) |
records/professional-licensing |
Discipline decisions for lawyers, accountants, engineers, financial advisers, real estate agents |
records/legislation |
Statutes, regulations, orders-in-council, policy directives, white papers |
Government and public records
| Skill | Covers |
|---|---|
records/government-contracts |
RFPs, sole-source justifications, contract award notices |
records/procurement-records |
Post-award contracts, amendments, vendor performance, standing offer call-ups |
records/audit-reports |
Auditor general reports, performance audits, inspector general reports |
records/government-reports |
Royal commissions, public inquiries, parliamentary committee reports |
records/foi-responses |
FOI/ATI response packages, exemption indexes, redaction logs |
records/legislature-transcripts |
Hansard, committee transcripts, question period, congressional hearings |
records/lobbying-records |
Lobbyist registrations, communication reports, revolving door disclosures |
records/election-filings |
Campaign finance returns, donor lists, third-party advertising disclosures |
records/municipal-records |
Council minutes, zoning decisions, conflict-of-interest declarations |
records/police-records |
Occurrence reports, use-of-force records, public complaint decisions, coroner's inquests |
records/corrections-records |
Parole board decisions, probation orders, prison inspection reports, correctional oversight |
records/environmental-filings |
Pollutant release inventories, environmental assessments, compliance orders |
Property
| Skill | Covers |
|---|---|
records/real-estate |
Title transfers, mortgages, liens, assessments, market transactions |
records/land-registries |
Land registry and title systems โ common law and civil law; deeds, charges, caveats |
records/vehicle-registrations |
Motor vehicle and vessel registrations, title transfers, liens, fleet records |
Specialized
| Skill | Covers |
|---|---|
records/academic-research |
Grant applications, ethics decisions, conflict-of-interest disclosures, retraction notices |
records/aircraft-logs |
Aircraft registrations, ADS-B flight tracks, safety investigation reports |
records/dns-whois |
WHOIS records, DNS data, IP allocation, SSL certificate transparency logs |
records/news-clippings |
News articles, press releases, wire stories, corrections, retractions |
records/audio-video |
YouTube transcripts, podcast transcripts, earnings calls, press conference recordings |
These skills encode real investigative knowledge โ what fields are always present, what patterns are anomalous, what investigators typically miss. See src/watchdog/skills/records/ to read them or contribute new ones. A contributor template is at src/watchdog/skills/records/_template.md.
Multiple investigations
Watchdog is installed once. Each investigation is a separate vault:
watchdog new "Municipal Contracts Investigation"
watchdog new "Healthcare Funding Investigation"
watchdog list
watchdog status municipal-contracts-investigation
watchdog open municipal-contracts-investigation
Project names tab-complete in zsh, bash, and fish after installation.
Configuration
watchdog configure reads and writes ~/.watchdog/config.json. Run it with no arguments to see current values:
watchdog configure
To set a value:
watchdog configure <key> <value>
| Key | Default | Description |
|---|---|---|
projects_dir |
~/Investigations |
Where new investigation vaults are created. Set during watchdog setup, change here afterwards. |
ocr_languages |
(auto-detect) | Language codes for Apple Vision OCR, comma-separated (e.g. en-US,fr-FR). Only applies when using Apple Vision (auto mode on macOS or ocr_engine=apple_vision). Leave unset to let macOS 13+ detect the language automatically. Codes use BCP 47 format. |
ocr_engine |
auto |
OCR engine for scanned documents. auto uses Apple Vision on macOS (if ocrmac is installed) and Tesseract elsewhere. Options: auto, apple_vision, tesseract, easyocr, rapidocr. Tesseract must be installed at the system level: brew install tesseract on macOS or sudo apt install tesseract-ocr on Debian/Ubuntu. |
table_structure |
true |
Whether Docling runs its table detection model on PDFs. Set to false to speed up ingestion of text-only documents (court decisions, contracts, regulatory filings) that contain no meaningful tables. |
garbled_threshold |
0.75 |
Fraction of alphanumeric characters below which a PDF text layer is considered garbled and OCR is triggered automatically. Range: 0.0โ1.0. |
chunk_size |
40 |
Pages per chunk when splitting large PDFs for parallel processing. |
chunk_workers |
(half CPU cores) | Parallel subprocesses for large-PDF processing. Set automatically at watchdog setup based on your machine. |
chunk_timeout |
300 |
Seconds before a chunk subprocess is killed. Increase for very large or complex PDFs on slow machines. |
dup_threshold |
0.85 |
Jaccard similarity score at which two documents are flagged as near-duplicates. Range: 0.0โ1.0. |
shingle_size |
3 |
Word n-gram size for near-duplicate fingerprinting. Changing this invalidates existing shingle data โ re-ingest to rebuild. |
embed_images |
false |
Embed figures and images as base64 in the markdown output so Claude can read charts and image-based tables. Significantly increases token usage and processing time. Only enable for document sets where visual content carries investigative value. |
Examples:
# Switch to Tesseract on a non-Mac machine where it is already installed
watchdog configure ocr_engine tesseract
# Disable table detection for a project that is all court decisions
watchdog configure table_structure false
# Override OCR languages for a collection of French and Arabic documents
watchdog configure ocr_languages "fr-FR,ar-SA"
# Move investigation storage to an external drive
watchdog configure projects_dir /Volumes/SecureDrive/Investigations
Aliases: config, setting, settings all resolve to configure.
A note on AI and hallucination
Watchdog uses Claude to read documents and extract facts. AI can make mistakes โ confabulate specificity, misread names, or draw incorrect inferences.
A few safeguards are built in:
- Every extracted fact carries a confidence level (
high,medium,low,disputed) - Every claim links to the source document and page so you can verify it directly
low-confidence facts are leads, not findings โ they belong in the vault but must not be treated as established/watchdog-entitylets you refresh an entity's Summary and Timeline at any time, re-synthesizing from all source documents rather than relying on a chain of incremental updates
Treat everything Watchdog produces as a structured first read, not a finished product. The vault is a tool for your reporting, not a replacement for it.
Alpha limitations
- macOS only for the scripted installer. Linux and Windows (WSL2) work but require manual setup โ see INSTALL.md.
- Domain skills are v1. The extraction skills are well-researched but have not yet been validated in a live investigation. Expect rough edges โ and please contribute improvements.
- No global entity registry. Entities are scoped to a single vault. Cross-investigation entity matching is planned for a future release.
- Audio/video requires extra setup. Speech-to-text (
--with-asr) adds significant install time and disk space.
Contributing
Contributions most welcome in three areas:
Domain knowledge skills โ if you have deep expertise reading a document type that isn't covered, open an issue or submit a pull request to src/watchdog/skills/records/. The format is plain markdown โ no code required. Copy _template.md as your starting point; it includes the standard structure and authoring notes.
Pipeline fixes โ src/watchdog/pipeline/ contains the Python preprocessing code. Bug reports with a sample document (redacted if needed) are especially useful.
Installation and documentation โ INSTALL.md is written for non-technical journalists. Corrections, clarifications, and translations are welcome.
Please open an issue before starting significant work so we can discuss approach first.
Architecture notes
- Docling handles all document conversion โ layout analysis, table extraction, OCR. Structured output (not raw text) is important for table-heavy documents like financial statements and creditor lists.
- Large PDFs are split into 40-page chunks and processed in parallel via
watchdog preprocess-batch. Page numbers are preserved and reassembled in order. - OCR engine: Apple Vision on macOS (fast, hardware-accelerated); Tesseract on Linux/Windows (requires system install). Configurable via
watchdog configure ocr_engine. - Near-duplicate detection uses Jaccard similarity on word 3-gram shingles โ no ML dependencies, runs locally.
- Registries (
.watchdog/Registry/documents.json,entities.json,manifest.json) are the source of truth. Obsidian notes are generated outputs โ deleting a note doesn't lose data.manifest.jsonis a lightweight id/name/type/aliases index used for entity lookup without loading full registry data. - Vault writes are atomic โ
watchdog write-vaulthandles entity notes, document notes, timeline, registries, and the morgue move in a single operation behind an ingest lock. - Single CLI entry point โ
watchdogis the only command installed on your PATH. Pipeline utilities (watchdog preprocess,watchdog write-vault, etc.) are subcommands, not separate binaries.
Acknowledgements
Watchdog's vault structure and session-context approach were partly inspired by claude-obsidian by Daniel Agrici โ a PKM framework built on Claude Code that demonstrated how to make an AI assistant genuinely vault-aware across sessions. The hot.md session state file and the general principle of teaching Claude to orient itself from structured vault context both draw on ideas in that project.
The semantic search index uses fastembed (by Qdrant) with the BAAI/bge-small-en-v1.5 model โ a lightweight ONNX-based embedding library that avoids the PyTorch dependency footprint while matching the quality of heavier alternatives. The idea of embedding raw document pages for retroactive search across a large corpus, separate from the extracted knowledge graph, was partly informed by obsidian-smart-connections by Brian Petro. The pattern of using a structured vault index for entity lookup โ rather than embedding everything โ was informed by obsidian-claude-code.
License
MIT โ see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file watchdog_intel-0.1.0a1.tar.gz.
File metadata
- Download URL: watchdog_intel-0.1.0a1.tar.gz
- Upload date:
- Size: 210.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
61939ed3839ae5ce42e864a34822135b008583f922c1e31a4b77b60d41a2ddff
|
|
| MD5 |
bf7b7b2a6f89b851d80a8b2daffda6a6
|
|
| BLAKE2b-256 |
9f4fe6d3617afe811dea93f95168462d4644a8bec960a82728b980da9dac019e
|
Provenance
The following attestation bundles were made for watchdog_intel-0.1.0a1.tar.gz:
Publisher:
publish.yml on tomcardoso/watchdog
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
watchdog_intel-0.1.0a1.tar.gz -
Subject digest:
61939ed3839ae5ce42e864a34822135b008583f922c1e31a4b77b60d41a2ddff - Sigstore transparency entry: 1771030713
- Sigstore integration time:
-
Permalink:
tomcardoso/watchdog@9a14738597a1c848a8a99a3070f743db643caf1a -
Branch / Tag:
refs/tags/v0.1.0a1 - Owner: https://github.com/tomcardoso
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9a14738597a1c848a8a99a3070f743db643caf1a -
Trigger Event:
release
-
Statement type:
File details
Details for the file watchdog_intel-0.1.0a1-py3-none-any.whl.
File metadata
- Download URL: watchdog_intel-0.1.0a1-py3-none-any.whl
- Upload date:
- Size: 234.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
75c1f751a3dd183f2c315898b270821fc4e4523710b829b69cdbe374150912f4
|
|
| MD5 |
096c92b6c1a786b728dc9a9eb5117c53
|
|
| BLAKE2b-256 |
0572be2d6f4d88ebfcce9ce9269a4fb14e9f469eefae8b9531c7eafbfacd219d
|
Provenance
The following attestation bundles were made for watchdog_intel-0.1.0a1-py3-none-any.whl:
Publisher:
publish.yml on tomcardoso/watchdog
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
watchdog_intel-0.1.0a1-py3-none-any.whl -
Subject digest:
75c1f751a3dd183f2c315898b270821fc4e4523710b829b69cdbe374150912f4 - Sigstore transparency entry: 1771030845
- Sigstore integration time:
-
Permalink:
tomcardoso/watchdog@9a14738597a1c848a8a99a3070f743db643caf1a -
Branch / Tag:
refs/tags/v0.1.0a1 - Owner: https://github.com/tomcardoso
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9a14738597a1c848a8a99a3070f743db643caf1a -
Trigger Event:
release
-
Statement type: