A framework for creating and storing FollowTheMoney entities, used by OpenLobbying.
Project description
A reusable framework for creating and storing FollowTheMoney entities.
[!WARNING] This is a work in progress. Expect breaking changes and incomplete features.
Muckrake
Muckrake is the data pipeline. It is partially inspired by zavod and other tools in the FollowTheMoney ecosystems.
Run uv run muckrake --help for a full list of available commands.
Install Python dependencies with uv sync. This now includes the external org-id package used for structured organization identifiers.
OpenLobbying-specific code now lives in the sibling ../openlobbying/ repository. That repo owns:
- the OpenLobbying dataset crawlers
- the OpenLobbying FastAPI application
- the OpenLobbying Svelte frontend
- any project-specific FtM schema extensions
- deployment assets for the public site
Crawlers
Muckrake discovers crawler configs from ./datasets/ in the current working directory and any paths listed in MUCKRAKE_DATASET_PATHS. At a minimum, each dataset consists of a config.yml with metadata and a crawler.py script that outputs FollowTheMoney statements in CSV format.
To crawl a dataset, run uv run muckrake crawl {dataset_name}. Run uv run muckrake list to see available datasets.
Each crawl now creates a dataset_runs record in Postgres and stores immutable artifacts under MUCKRAKE_ARTIFACT_PATH (defaults to data/artifacts). The latest successful run remains mirrored into data/datasets/{name}/statements.pack.csv for local compatibility.
AI-based NER
Many data sources have composite fields that contain multiple entities. We use LLMs to extract unique entities and relationships from these fields, and store them as candidates in the database for review and approval. See NER docs for details.
# Create extraction candidates for one dataset
uv run muckrake ner-extract open_access --extractor llm --limit 50
# Review candidates in a terminal UI
uv run muckrake ner-review
Dedupe
Our goal is to link entities across datasets to provide a unified view of lobbying and political finance for any given person, company, or organisation.
# Create dedupe candidates across all datasets
uv run muckrake xref
# Review candidates in a terminal UI
uv run muckrake dedupe
We also want to collapse duplicate relationship edges across datasets, especially for ORCL and PRCA. This is done automatically, no review step required.
uv run muckrake dedupe-edges
Loading
Statements are loaded into Postgres with uv run muckrake load. This reads the statements CSV files and applies any approved NER candidates before materialising entities and relationships.
To load from a specific immutable crawl snapshot instead of the local workspace copy:
uv run muckrake load gb_political_finance --run-id 123
For the published site, prefer the release workflow instead of loading directly into the serving database:
uv run muckrake release-build
uv run muckrake release-publish 1
OpenLobbying
The primary user of Muckrake data is OpenLobbying, an open database of lobbying and political finance data. See ../openlobbying/README.md for app setup, API serving, and frontend development.
Environment setup
- Copy
.env.exampleto.envin the repo root. - By default
muckrakeloads the nearest.envfrom the current working directory upward. Override that withMUCKRAKE_ENV_FILEif needed. - Default local database setup:
- working DB:
sqlite:///data/muckrake.db - published DB: same as the working DB unless
MUCKRAKE_PUBLISHED_DATABASE_URLis set
- working DB:
- Required only if you want Postgres:
MUCKRAKE_DATABASE_URL
- Common local settings:
MUCKRAKE_PUBLISHED_DATABASE_URLfor a separate published API database
- Optional local overrides:
MUCKRAKE_DATA_PATHMUCKRAKE_ARTIFACT_PATHMUCKRAKE_DATASET_PATHSMUCKRAKE_FTM_SCHEMA_PATHSMUCKRAKE_ENV_FILEFTM_MODEL_PATHif you need to override the merged FollowTheMoney model entirelyOPENROUTER_API_KEY,LLM_MODEL,NER_LLM_PROMPT_FILE,LOGFIRE_TOKEN
- Example:
cp .env.example .env
Consumers
../openlobbying/: OpenLobbying application repo built on top ofmuckrake../us-congress-lobbying/: project-specific investigative sandbox
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file muckrake-0.2.1.tar.gz.
File metadata
- Download URL: muckrake-0.2.1.tar.gz
- Upload date:
- Size: 55.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ada0aa25c4270c2897efc7240aa09c0166e08e1362d7b5f4f01f65ade8b4fc1e
|
|
| MD5 |
30fff2244f8fa4b006b8148e6f511868
|
|
| BLAKE2b-256 |
974e3ef2639bc48353bcee7b9ef0850321c7c4905f2764667308a32acd67cefb
|
Provenance
The following attestation bundles were made for muckrake-0.2.1.tar.gz:
Publisher:
publish.yml on openlobbying/muckrake
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
muckrake-0.2.1.tar.gz -
Subject digest:
ada0aa25c4270c2897efc7240aa09c0166e08e1362d7b5f4f01f65ade8b4fc1e - Sigstore transparency entry: 1704261077
- Sigstore integration time:
-
Permalink:
openlobbying/muckrake@d843291665dcd54148028b321e24524764114687 -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/openlobbying
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d843291665dcd54148028b321e24524764114687 -
Trigger Event:
push
-
Statement type:
File details
Details for the file muckrake-0.2.1-py3-none-any.whl.
File metadata
- Download URL: muckrake-0.2.1-py3-none-any.whl
- Upload date:
- Size: 73.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
be62a9ce19ac6d8c0c9e2815dc886247af794148685c903f3325a18115518de5
|
|
| MD5 |
9e863855e24bf1638ef75a6deeb2127d
|
|
| BLAKE2b-256 |
158e7b53b6571abde6fb59cb9a0c5f28ab53e5eae2c30342d45664a5aaf8989a
|
Provenance
The following attestation bundles were made for muckrake-0.2.1-py3-none-any.whl:
Publisher:
publish.yml on openlobbying/muckrake
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
muckrake-0.2.1-py3-none-any.whl -
Subject digest:
be62a9ce19ac6d8c0c9e2815dc886247af794148685c903f3325a18115518de5 - Sigstore transparency entry: 1704261082
- Sigstore integration time:
-
Permalink:
openlobbying/muckrake@d843291665dcd54148028b321e24524764114687 -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/openlobbying
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d843291665dcd54148028b321e24524764114687 -
Trigger Event:
push
-
Statement type: