Skip to main content

Local-first file scanner and labeler that lets you work safely with autonomous AI agents. Multilingual, currently focused on Dutch PII and GDPR.

Project description

Filenthropist

The local-first file scanner and labeler that lets you work safely with autonomous AI agents. Multilingual by design, currently focused on Dutch PII and GDPR — so your governance tool never becomes a data breach.

Filenthropist scans documents, detects PII, and labels each file with an access level that autonomous AI agents check before touching it. It runs entirely on your machine, never sends data to the cloud, and exposes a single HTTP API endpoint for agents to consult. The architecture is language-agnostic; the current release ships with Dutch-focused detection and GDPR-aware reporting, with more languages planned.


Why Filenthropist?

  • Local-only processing — files never leave your machine. No cloud APIs, no data uploads, no telemetry.
  • Dutch-focused PII detection today — built for BSN, IBAN, Dutch addresses, KvK numbers, and NL-specific identifiers. Not a US-centric tool with a language pack bolted on. The architecture is language-agnostic and a multilingual model is already available in the registry — see docs/MODELS.md.
  • Redacted previews only — the database stores redacted text, not raw PII. Even if someone accesses the SQLite file, they get [NAAM] and [BSN], never the personal data.
  • AI agent gateway — one API call (/api/can-access) tells any agent whether a file is safe to use.
  • GDPR Article 30 RoPA export — generates a verwerkingsregister directly from your scanned files, with Dutch legal bases pre-filled per document type.
  • Advisory-only — Filenthropist labels and advises. It never deletes, modifies, or moves your files.

System requirements

  • Python 3.11 or newer
  • Tesseract 5 with the Dutch language pack (needed for OCR on scanned PDFs and images)
  • ~1 GB free disk for a PII NER model (downloaded on first run)

Install Tesseract

macOS (Homebrew):

brew install tesseract tesseract-lang

Debian / Ubuntu:

sudo apt install tesseract-ocr tesseract-ocr-nld

For a step-by-step guide covering Python, Tesseract, Windows, and common errors, see docs/INSTALL.md.


Install

We recommend pipx so Filenthropist lives in its own isolated environment:

pipx install "filenthropist[all]"

The base install is intentionally minimal. Pick the extras that match what you need:

pipx install filenthropist              # CLI only, no OCR/NER/web
pipx install "filenthropist[ocr]"       # + Tesseract OCR
pipx install "filenthropist[ner]"       # + Dutch/multilingual PII NER
pipx install "filenthropist[web]"       # + local dashboard
pipx install "filenthropist[all]"       # everything (recommended)

Quick start

Four commands take you from a fresh install to a usable dashboard:

filenthropist doctor                    # verify environment
filenthropist init                      # interactive model picker
filenthropist scan ~/Documents          # run first scan
filenthropist serve                     # open web dashboard
  • doctor — checks Python version, Tesseract install (with Dutch pack), writable config directory, and reports anything missing so you can fix it before scanning.
  • init — walks you through selecting a PII NER model based on your language and priority (speed, balanced, accuracy). Downloads the model and writes your config.
  • scan <path> — walks the directory, extracts text (including OCR for scanned PDFs), classifies each document, detects PII, and stores redacted labels in ~/.filenthropist/filenthropist.db.
  • serve — starts the local dashboard at http://localhost:8080 for reviewing labels and making retention decisions.

For three persona-based walkthroughs (freelancer, SME, multilingual org), see docs/QUICKSTART.md.


Choosing a PII model

Filenthropist supports multiple PII NER models so you can trade off language coverage, accuracy, and speed.

  • Recommended for Dutch documents: LokaalHub/nl-lokaal-middel — F1 0.84 on the ai4privacy Dutch validation set. Detects Dutch names, addresses, BSN-in-context, IBAN, phone, email, and more. A faster/smaller sibling LokaalHub/nl-lokaal-klein (F1 0.78, ~180 MB) is the default for laptops and the combined provider.
  • Multilingual and English-only options are available for mixed-language or English-only corpora.
  • Speed vs. accuracy — each model is tagged with a priority tier (fast, balanced, accuracy) so you can match it to your hardware.

Browse and pick interactively:

filenthropist init                      # wizard: language + priority
filenthropist models list               # show every model in the registry
filenthropist models info <model-id>    # details for one model

Full decision tree and registry docs: docs/MODELS.md.


Scanning & querying

# Show all files with sensitive PII
filenthropist query --access-level sensitive_restricted

# Export all labels as JSON
filenthropist export --format json --output labels.json

# Export a GDPR Article 30 verwerkingsregister
filenthropist ropa --format csv --output verwerkingsregister.csv

The web dashboard (filenthropist serve) exposes the same data plus a review workflow for non-technical users.

AI agents integrate via a local HTTP API — GET /api/can-access?path=... returns an allow/deny decision, and GET /api/redacted?path=... returns the document text with PII replaced by type labels ([NAAM], [BSN], [IBAN]). See docs/AGENT_INTEGRATION.md for the full endpoint list and integration patterns.


Configuration

On first run, Filenthropist writes ~/.filenthropist/config.yaml. Edit it to tune scan behaviour, PII provider, and retention:

scan:
  ignore_patterns: [".git", "node_modules", "__pycache__", ".venv"]
  max_file_size_mb: 100

pii:
  provider: "combined"         # "regex", "ner", "combined", "http", or "stub"
  ner_model_id: "LokaalHub/nl-lokaal-middel"

labeling:
  retention_policy_years: 2

classification:
  zeroshot_enabled: true

The combined provider runs regex detectors (BSN, IBAN, phone, email, postcode) alongside the NER model — structured PII that NER often misses is still caught.


Privacy & security

  • Fully local. No network calls except model downloads, which you can pre-cache for offline use.
  • Advisory-only. Filenthropist never deletes, moves, or modifies your files.
  • Redacted previews. The database stores redacted text, so compromising the DB does not leak raw PII.

For the threat model and hardening recommendations, see SECURITY.md.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filenthropist-0.3.1.tar.gz (1.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

filenthropist-0.3.1-py3-none-any.whl (1.4 MB view details)

Uploaded Python 3

File details

Details for the file filenthropist-0.3.1.tar.gz.

File metadata

  • Download URL: filenthropist-0.3.1.tar.gz
  • Upload date:
  • Size: 1.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filenthropist-0.3.1.tar.gz
Algorithm Hash digest
SHA256 4771940d1f726f06a1980dddc454000f232cbc2312caf45786040946d61bc935
MD5 930dff388cebe693413ba26a96a1ae6e
BLAKE2b-256 45253aeca93133ab4f56188d20aad85b30267da9d3cf0c0cf202a8bea0601fed

See more details on using hashes here.

Provenance

The following attestation bundles were made for filenthropist-0.3.1.tar.gz:

Publisher: publish.yml on LokaalHub/filenthropist

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file filenthropist-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: filenthropist-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for filenthropist-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9e8322eb47964daf0f6e4429765a975fbe62f12a6f7b8b865e7bb9f680070af5
MD5 20d1c895b6fe292ba2c2ee424884a729
BLAKE2b-256 1c3fbdd61c731365a638d6783da76240db8e327a3d6bcf1bab47fb517fcdae3c

See more details on using hashes here.

Provenance

The following attestation bundles were made for filenthropist-0.3.1-py3-none-any.whl:

Publisher: publish.yml on LokaalHub/filenthropist

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page