Skip to main content

Scan dataset files for personally identifiable information.

Project description

pii-sweep

CI PyPI Python License: MIT

Scan dataset files for personally identifiable information, with a confidence per column and a CI gate, before the data leaves your hands.

Before a dataset is shared, copied to a notebook, or pushed to a bucket, it is worth knowing whether a column quietly holds emails, card numbers or national IDs. pii-sweep samples each column, runs a set of detectors, and reports which columns look like PII and how strongly.

$ pii-sweep scan customers.parquet
severity  column        type         confidence
high      card_number   credit_card  100%
high      tax_id        ssn          98%
medium    contact       email        91%

Install

$ pip install pii-sweep                 # from PyPI, once released
$ pip install git+https://github.com/jmweb-org/pii-sweep   # latest, available now

Reads CSV, Parquet and JSON Lines through polars.

Usage

$ pii-sweep scan data.csv                 # human-readable table
$ pii-sweep scan data.parquet --json      # machine-readable findings
$ pii-sweep scan data.csv --sample 5000   # cap values scanned per column
$ pii-sweep scan data.csv --threshold 0.3 # flag at a lower match fraction
$ pii-sweep scan data.csv --check         # exit non-zero if PII is found

In CI

Stop a dataset with PII from being committed or published:

- run: pii-sweep scan data/export.parquet --check --fail-on medium

What it detects

Type Severity How
credit_card high 13-19 digits passing the Luhn checksum
iban high Country format plus the mod-97 checksum
ssn high US social-security format with valid ranges
email medium Standard address pattern
phone medium International or grouped number, 9-15 digits
ipv4 low Dotted-quad address

Detectors with a checksum (cards, IBAN) are strict, so a column of random 13-digit numbers is not flagged as cards. The confidence is the fraction of sampled non-null values a detector matched; --threshold sets how high that must be to flag a column.

Exit codes

Code Meaning
0 Scanned; nothing at or above the fail severity (or --check not set)
1 --check found PII at or above --fail-on
2 The file was missing or in an unsupported format

Scope

pii-sweep finds structured PII with clear patterns. It does not detect free-text names or addresses, and a clean report is not a compliance guarantee. Treat it as a fast guardrail, not a substitute for review.

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pii_sweep-0.2.0.tar.gz (9.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pii_sweep-0.2.0-py3-none-any.whl (10.1 kB view details)

Uploaded Python 3

File details

Details for the file pii_sweep-0.2.0.tar.gz.

File metadata

  • Download URL: pii_sweep-0.2.0.tar.gz
  • Upload date:
  • Size: 9.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.20 {"installer":{"name":"uv","version":"0.11.20","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pii_sweep-0.2.0.tar.gz
Algorithm Hash digest
SHA256 2cd46c52ea8f608224f2681374c5e66647af6c20454d35654e5f3424db3092a5
MD5 fb77572d12cd0b95c1d1366aa32a17d0
BLAKE2b-256 3e1426a8be7130d7ca0041225207c791c0ea57a26bc4b8b046e5bb37e67afc08

See more details on using hashes here.

File details

Details for the file pii_sweep-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: pii_sweep-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 10.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.20 {"installer":{"name":"uv","version":"0.11.20","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pii_sweep-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7532228930777050a8602247d22c7123a459ca7d6d64c002fd187cce72792c1f
MD5 68953412aa2c3f10a0aa66a041cf43a4
BLAKE2b-256 0210d787e03d563c26c5aae30869a4772c9670e6d68a9dfeb738f4262b461a03

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page