Skip to main content

Local-first file scanner and labeler for AI agent governance. Dutch-first, GDPR-aware.

Project description

Filenthropist

The only local-first tool that scans your files for Dutch PII, labels them for AI agent access, and stores only redacted previews — so your governance tool never becomes a data breach.

Filenthropist is a file scanner and labeler built for Dutch SMEs navigating GDPR. It runs entirely on your machine, never sends data to the cloud, and gives AI agents a single API endpoint to check before touching any file.


Why Filenthropist?

  • Local-only processing — files never leave your machine. No cloud APIs, no data uploads, no telemetry.
  • Dutch-first PII detection — built for BSN, IBAN, Dutch addresses, KvK numbers, and NL-specific identifiers. Not a US-centric tool with a language pack bolted on.
  • Redacted previews only — the database stores redacted text, not raw PII. Even if someone accesses the SQLite file, they get [NAAM] and [BSN], never the personal data.
  • AI agent gateway — one API call (/api/can-access) tells any agent whether a file is safe to use.
  • GDPR Article 30 RoPA export — generates a verwerkingsregister directly from your scanned files, with Dutch legal bases pre-filled per document type.
  • Advisory-only — Filenthropist labels and advises. It never deletes, modifies, or moves your files.

System requirements

  • Python 3.11 or newer
  • Tesseract 5 with the Dutch language pack (needed for OCR on scanned PDFs and images)
  • ~1 GB free disk for a PII NER model (downloaded on first run)

Install Tesseract

macOS (Homebrew):

brew install tesseract tesseract-lang

Debian / Ubuntu:

sudo apt install tesseract-ocr tesseract-ocr-nld

For a step-by-step guide covering Python, Tesseract, Windows, and common errors, see docs/INSTALL.md.


Install

We recommend pipx so Filenthropist lives in its own isolated environment:

pipx install "filenthropist[all]"

The base install is intentionally minimal. Pick the extras that match what you need:

pipx install filenthropist              # CLI only, no OCR/NER/web
pipx install "filenthropist[ocr]"       # + Tesseract OCR
pipx install "filenthropist[ner]"       # + Dutch/multilingual PII NER
pipx install "filenthropist[web]"       # + local dashboard
pipx install "filenthropist[all]"       # everything (recommended)

Quick start

Four commands take you from a fresh install to a usable dashboard:

filenthropist doctor                    # verify environment
filenthropist init                      # interactive model picker
filenthropist scan ~/Documents          # run first scan
filenthropist serve                     # open web dashboard
  • doctor — checks Python version, Tesseract install (with Dutch pack), writable config directory, and reports anything missing so you can fix it before scanning.
  • init — walks you through selecting a PII NER model based on your language and priority (speed, balanced, accuracy). Downloads the model and writes your config.
  • scan <path> — walks the directory, extracts text (including OCR for scanned PDFs), classifies each document, detects PII, and stores redacted labels in ~/.filenthropist/filenthropist.db.
  • serve — starts the local dashboard at http://localhost:8080 for reviewing labels and making retention decisions.

For three persona-based walkthroughs (freelancer, SME, multilingual org), see docs/QUICKSTART.md.


Choosing a PII model

Filenthropist supports multiple PII NER models so you can trade off language coverage, accuracy, and speed.

  • Recommended for Dutch documents: LokaalHub/nl-lokaal-middel — F1 0.84 on the ai4privacy Dutch validation set. Detects Dutch names, addresses, BSN-in-context, IBAN, phone, email, and more. A faster/smaller sibling LokaalHub/nl-lokaal-klein (F1 0.78, ~180 MB) is the default for laptops and the combined provider.
  • Multilingual and English-only options are available for mixed-language or English-only corpora.
  • Speed vs. accuracy — each model is tagged with a priority tier (fast, balanced, accuracy) so you can match it to your hardware.

Browse and pick interactively:

filenthropist init                      # wizard: language + priority
filenthropist models list               # show every model in the registry
filenthropist models info <model-id>    # details for one model

Full decision tree and registry docs: docs/MODELS.md.


Scanning & querying

# Show all files with sensitive PII
filenthropist query --access-level sensitive_restricted

# Export all labels as JSON
filenthropist export --format json --output labels.json

# Export a GDPR Article 30 verwerkingsregister
filenthropist ropa --format csv --output verwerkingsregister.csv

The web dashboard (filenthropist serve) exposes the same data plus a review workflow for non-technical users.

AI agents integrate via a local HTTP API — GET /api/can-access?path=... returns an allow/deny decision, and GET /api/redacted?path=... returns the document text with PII replaced by type labels ([NAAM], [BSN], [IBAN]). See docs/AGENT_INTEGRATION.md for the full endpoint list and integration patterns.


Configuration

On first run, Filenthropist writes ~/.filenthropist/config.yaml. Edit it to tune scan behaviour, PII provider, and retention:

scan:
  ignore_patterns: [".git", "node_modules", "__pycache__", ".venv"]
  max_file_size_mb: 100

pii:
  provider: "combined"         # "regex", "ner", "combined", "http", or "stub"
  ner_model_id: "LokaalHub/nl-lokaal-middel"

labeling:
  retention_policy_years: 2

classification:
  zeroshot_enabled: true

The combined provider runs regex detectors (BSN, IBAN, phone, email, postcode) alongside the NER model — structured PII that NER often misses is still caught.


Privacy & security

  • Fully local. No network calls except model downloads, which you can pre-cache for offline use.
  • Advisory-only. Filenthropist never deletes, moves, or modifies your files.
  • Redacted previews. The database stores redacted text, so compromising the DB does not leak raw PII.

For the threat model and hardening recommendations, see SECURITY.md.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filenthropist-0.3.0.tar.gz (1.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

filenthropist-0.3.0-py3-none-any.whl (1.4 MB view details)

Uploaded Python 3

File details

Details for the file filenthropist-0.3.0.tar.gz.

File metadata

  • Download URL: filenthropist-0.3.0.tar.gz
  • Upload date:
  • Size: 1.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filenthropist-0.3.0.tar.gz
Algorithm Hash digest
SHA256 7a86ca2fc51c222072ea101ef18dc49e52c24404427c3eefe6add3510a8f44bf
MD5 ce4f3899403cad8300b4f48f526b5c0f
BLAKE2b-256 5385a1e400f71308c4b5bc9f281938636919784efa9d254c3d7c91a9e60a9311

See more details on using hashes here.

Provenance

The following attestation bundles were made for filenthropist-0.3.0.tar.gz:

Publisher: publish.yml on LokaalHub/filenthropist

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filenthropist-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: filenthropist-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filenthropist-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1ea4ed7d95363178d4c974e7c9049a8beb9e9b350a6c6b360019ad0b105a838a
MD5 b09194bea6fa734e6c2930bd14dcd698
BLAKE2b-256 c498b1d2940189186d28cbb675939d063a3bc3a69d8e926a722daf890ff72d13

See more details on using hashes here.

Provenance

The following attestation bundles were made for filenthropist-0.3.0-py3-none-any.whl:

Publisher: publish.yml on LokaalHub/filenthropist

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page