Scan dataset files for personally identifiable information.
Project description
pii-sweep
Scan dataset files for personally identifiable information, with a confidence per column and a CI gate, before the data leaves your hands.
Before a dataset is shared, copied to a notebook, or pushed to a bucket, it is
worth knowing whether a column quietly holds emails, card numbers or national
IDs. pii-sweep samples each column, runs a set of detectors, and reports which
columns look like PII and how strongly.
$ pii-sweep scan customers.parquet
severity column type confidence
high card_number credit_card 100%
high tax_id ssn 98%
medium contact email 91%
Install
$ pip install pii-sweep # from PyPI, once released
$ pip install git+https://github.com/jmweb-org/pii-sweep # latest, available now
Reads CSV, Parquet and JSON Lines through polars.
Usage
$ pii-sweep scan data.csv # human-readable table
$ pii-sweep scan data.parquet --json # machine-readable findings
$ pii-sweep scan data.csv --sample 5000 # cap values scanned per column
$ pii-sweep scan data.csv --threshold 0.3 # flag at a lower match fraction
$ pii-sweep scan data.csv --check # exit non-zero if PII is found
In CI
Stop a dataset with PII from being committed or published:
- run: pii-sweep scan data/export.parquet --check --fail-on medium
What it detects
| Type | Severity | How |
|---|---|---|
credit_card |
high | 13-19 digits passing the Luhn checksum |
iban |
high | Country format plus the mod-97 checksum |
ssn |
high | US social-security format with valid ranges |
email |
medium | Standard address pattern |
phone |
medium | International or grouped number, 9-15 digits |
ipv4 |
low | Dotted-quad address |
Detectors with a checksum (cards, IBAN) are strict, so a column of random
13-digit numbers is not flagged as cards. The confidence is the fraction of
sampled non-null values a detector matched; --threshold sets how high that
must be to flag a column.
Exit codes
| Code | Meaning |
|---|---|
| 0 | Scanned; nothing at or above the fail severity (or --check not set) |
| 1 | --check found PII at or above --fail-on |
| 2 | The file was missing or in an unsupported format |
Scope
pii-sweep finds structured PII with clear patterns. It does not detect free-text
names or addresses, and a clean report is not a compliance guarantee. Treat it
as a fast guardrail, not a substitute for review.
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pii_sweep-0.2.0.tar.gz.
File metadata
- Download URL: pii_sweep-0.2.0.tar.gz
- Upload date:
- Size: 9.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.20 {"installer":{"name":"uv","version":"0.11.20","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2cd46c52ea8f608224f2681374c5e66647af6c20454d35654e5f3424db3092a5
|
|
| MD5 |
fb77572d12cd0b95c1d1366aa32a17d0
|
|
| BLAKE2b-256 |
3e1426a8be7130d7ca0041225207c791c0ea57a26bc4b8b046e5bb37e67afc08
|
File details
Details for the file pii_sweep-0.2.0-py3-none-any.whl.
File metadata
- Download URL: pii_sweep-0.2.0-py3-none-any.whl
- Upload date:
- Size: 10.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.20 {"installer":{"name":"uv","version":"0.11.20","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7532228930777050a8602247d22c7123a459ca7d6d64c002fd187cce72792c1f
|
|
| MD5 |
68953412aa2c3f10a0aa66a041cf43a4
|
|
| BLAKE2b-256 |
0210d787e03d563c26c5aae30869a4772c9670e6d68a9dfeb738f4262b461a03
|