Skip to main content

Japanese-first PII scanner for source repositories with gitleaks/trufflehog-like UX

Project description

pleno-pii-scanner

CLI that detects Japanese PII in repository contents, commit history, and staged hunks.

Setup

Run straight from PyPI — no clone, no uv sync:

uvx pleno-pii-scanner --help

The ja_ner_ja spaCy model is downloaded into the uvx-managed environment on first NER invocation. To pin a persistent install, use uv tool install pleno-pii-scanner instead. Workspace contributors get the model wheel preinstalled via uv sync (it lives in the dev dependency group).

Higher-precision HF backend (opt-in)

For DLP-grade workloads where false-positive <ORGANIZATION> masks are unacceptable, opt into the HuggingFace token-classification backend (model model/v0.13.0, 0xhikae/ja-ner-onnx@v0.13.0). It applies a per-label confidence floor (default ORGANIZATION=0.99) — overall F1 0.452 → 0.701 vs the spaCy baseline on the v0.12.0/ja adversarial corpus.

PLENO_PII_SCANNER_BACKEND=hf \
  uvx --with 'pleno-pii-scanner[hf]' pleno-pii-scanner dir <path>

Tunables:

  • PLENO_PII_SCANNER_THRESHOLDS=ORGANIZATION=0.99,PERSON=0.0 — per-label confidence floor (default ORG=0.99).
  • PLENO_PII_SCANNER_HF_MODEL / PLENO_PII_SCANNER_HF_REVISION — pin to a custom HF Hub repo / revision (default 0xhikae/ja-ner-onnx@v0.13.0).

The HF backend adds ~600 MB of torch + transformers; the default install stays lightweight.

Subcommands

uvx pleno-pii-scanner dir <path>                # walk a directory
uvx pleno-pii-scanner git <path>                # working tree plus commit history
uvx pleno-pii-scanner github <owner>/<repo>     # shallow clone, then scan
uvx pleno-pii-scanner github --org <org>        # enumerate org repos via gh CLI, then scan all
uvx pleno-pii-scanner baseline <path>           # write current findings as a suppression list
uvx pleno-pii-scanner protect                   # scan only staged hunks for pre-commit hooks

Local vs. offload

Default mode runs Presidio, spaCy NER, and regex on this machine. Pass --base-url to offload the same pipeline to a remote pleno-anonymize endpoint.

uvx pleno-pii-scanner dir ./my-repo --base-url https://pleno-anonymize.fly.dev
PLENO_BASE_URL=... uvx pleno-pii-scanner dir ./my-repo
uvx pleno-pii-scanner dir ./my-repo --base-url ... --api-key "$PLENO_API_KEY"

Both modes return the same entity set. Git history scans always use regex only, since per-line NER is not worth the cost on short diff lines.

Detected entities

NER from ja_ner_ja plus Presidio: PERSON ADDRESS ORGANIZATION DATE_OF_BIRTH BANK_ACCOUNT

Regex plus checksum: PHONE_NUMBER MY_NUMBER MY_NUMBER_CORPORATE CREDIT_CARD PASSPORT DRIVER_LICENSE HEALTH_INSURANCE RESIDENCE_CARD POSTAL_CODE EMAIL_ADDRESS IP_ADDRESS URL

URL, HEALTH_INSURANCE, and DRIVER_LICENSE are excluded from the default profile because they fire too often in source repos. Pass --entities ALL to include them, or --entities PHONE_NUMBER,EMAIL_ADDRESS to scan a specific subset.

Verification

Each finding carries one of three labels.

  • passed — checksum validated by Luhn, My Number, or corporate-number rules, or a contextual keyword sits within range.
  • failed — checksum failed; likely a false positive.
  • unverified — no validator matched and no contextual keyword was found.

--only-verified keeps passed only.

Output

--report-format Use case
human default colorized table on stdout
json machine-readable
sarif SARIF 2.1.0 for GitHub Code Scanning

--report-path FILE writes to a file. Exit code is 0 for no findings, 1 when findings are present, 2 for usage errors.

DB-cluster mode (recommended for repo audits)

Repository-level PII risk follows database shape, not single mentions. A contact email in a CODE_OF_CONDUCT is one identifiable person, not an exfiltration target; a CSV row with name + phone + email + my_number is.

--db-only keeps a finding only when its file or folder forms a cluster of co-occurring detections with multiple distinct values:

uvx pleno-pii-scanner dir ./my-repo --db-only
uvx pleno-pii-scanner github owner/repo --db-only

Tunables (defaults shown):

Flag Default Meaning
--db-file-threshold 2 Minimum findings in one file to qualify as a DB cluster.
--db-folder-threshold 3 Minimum findings in one folder (for sharded-DB shape).

verification=failed findings (e.g. ISBN matched as MY_NUMBER) are excluded from cluster computation so an awesome-list of book links can not promote a folder to DB-shaped. On the v0.2.4 ten-repo Japanese eval, this mode takes 6/10 repos from "findings to triage" to zero while keeping every real exposure (resumes, PII fixture banks, contributor lists).

Suppression

A .plenoignore file at the repo root is read automatically.

docs/samples/**          # path glob in gitignore syntax
PHONE_NUMBER             # entity-wide
finding:7a3b8c9d         # specific finding fingerprint

Inline directives:

SUPPORT_PHONE = "0120-123-456"  # pleno:ignore PHONE_NUMBER
EXAMPLE_EMAIL = "user@example.com"  # pleno:ignore

pleno-pii-scanner baseline writes a fingerprint list of current findings; passing --baseline FILE later suppresses those known findings.

Key flags

Flag Default Role
--entities default profile restrict detection set, PHONE,EMAIL or ALL
--language ja analysis language, ja or en
--base-url unset offload to a remote pleno-anonymize
--api-key unset Bearer token for offload
--concurrency 8 parallel HTTP requests in offload mode
--include / --exclude unset gitignore-style file filters
--max-file-size 1 MB files larger than this are skipped
--only-verified off keep passed findings only
--report-format human human, json, or sarif
--baseline unset fingerprint JSON of known findings to suppress

.gitignore, a built-in skip list for .git, node_modules, .venv, dist, build, vendor, and similar directories, and a NUL-byte binary check are all on by default.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pleno_pii_scanner-0.2.5.tar.gz (54.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pleno_pii_scanner-0.2.5-py3-none-any.whl (44.6 kB view details)

Uploaded Python 3

File details

Details for the file pleno_pii_scanner-0.2.5.tar.gz.

File metadata

  • Download URL: pleno_pii_scanner-0.2.5.tar.gz
  • Upload date:
  • Size: 54.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pleno_pii_scanner-0.2.5.tar.gz
Algorithm Hash digest
SHA256 c84b3dfa28041ac7a99c0e8ea26e0a8df029fd04b24a342345c036eefb7d5d59
MD5 7823c21e9b87d268a1f4da038d96abd0
BLAKE2b-256 1cbd2da0ec0501aacaffe07a976bdebb00b4d3910a58f62d04700f1b56207d86

See more details on using hashes here.

Provenance

The following attestation bundles were made for pleno_pii_scanner-0.2.5.tar.gz:

Publisher: release-pypi.yml on plenoai/pleno-anonymize

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pleno_pii_scanner-0.2.5-py3-none-any.whl.

File metadata

File hashes

Hashes for pleno_pii_scanner-0.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 982a3741906af3560b4069e027fcfed6aad0d619256899c84b92245041946300
MD5 25a85c1f68b2327055a05ae06a9b3990
BLAKE2b-256 138a956b43b0089c9dc462f79ee4e1fee112b5c9699a47a70db1abc2f4d134a2

See more details on using hashes here.

Provenance

The following attestation bundles were made for pleno_pii_scanner-0.2.5-py3-none-any.whl:

Publisher: release-pypi.yml on plenoai/pleno-anonymize

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page