A batteries-included Python toolkit for detecting, transforming, masking, pseudonymizing, and auditing PII across data engineering workflows

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ay-mich

These details have not been verified by PyPI

Project description

NoPII

A Python package for detecting, transforming, and auditing Personally Identifiable Information (PII) in your data. Supports multiple data sources including CSV, JSON, Parquet, and pandas DataFrames with policy-driven configuration.

Features

🔍 PII Detection

Built-in Detectors: Identifies email addresses, phone numbers, credit cards, SSNs, IP addresses, names, addresses, and dates of birth
Confidence Scoring: Each detection includes a confidence score (0-100%) with configurable thresholds to balance precision and recall
Custom Pattern Support: Create your own detectors using regex patterns or implement the BaseDetector interface for complex logic
Multi-language Support: Localized detection patterns for different regions and formats (US phone numbers, international emails, etc.)

🛡️ Transformation Strategies

Masking: Replace characters with asterisks or custom symbols while preserving format (e.g., john@example.com → ****@example.com)
Redacting: Replace entire PII values with placeholder text (e.g., john@example.com → [REDACTED])
Hashing: One-way cryptographic transformation using SHA-256 or other algorithms, with optional salt for security
Tokenization: Replace with reversible tokens for data analysis while maintaining referential integrity across datasets
Nullification: Replace with null/empty values for complete data removal

📊 Data Processing

Pandas DataFrames: Process tabular data with vectorized operations for performance, supporting column-wise scanning and transformation
File Formats: Direct support for CSV, JSON, Parquet, and Excel files with streaming for large datasets
Text & Dictionaries: Scan and transform plain text strings and Python dictionaries for flexible data handling
Memory Efficient: Streaming processing for large files to avoid loading entire datasets into memory

📋 Policy Management

YAML Configuration: Human-readable policy files defining detection rules, transformation actions, and confidence thresholds
Rule-based System: Match PII types (email, phone, ssn) to specific actions (mask, redact, hash) with customizable options
Exception Handling: Define patterns to skip (e.g., company email domains, test data) with regex-based exclusions
Policy Validation: Built-in validation ensures policy syntax is correct and transformation options are compatible

🔧 CLI & SDK

Command Line Interface: Five main commands (scan, transform, report, diff, policy) for file processing and policy management
Python SDK: High-level NoPIIClient for quick operations and low-level Scanner/Transform classes for fine-grained control
Audit Reporting: JSON audit trails with HTML/Markdown report generation including coverage metrics and PII type breakdowns
Coverage Scoring: Quantitative metrics showing percentage of data scanned and residual risk assessment

Installation

pip install nopii

The base installation includes core PII detection and transformation capabilities for text files, JSON, and basic CSV processing.

Optional Dependencies

Install optional extras for extended functionality:

# Pandas support for DataFrame operations and advanced tabular file formats
# Enables: Excel files, Parquet, advanced CSV operations, column-wise processing
pip install "nopii[pandas]"

# HTML reporting with styled templates and interactive elements
# Enables: Rich HTML reports, charts, detailed PII breakdowns, export options
pip install "nopii[report-html]"

# Install all optional dependencies
pip install "nopii[pandas,report-html]"

Quick Start

CLI Usage

The CLI provides five main commands for different PII processing workflows:

# Scan: Detect PII without modifying data
# Outputs findings with confidence scores and locations
nopii scan data.csv --format json --output scan_results.json

# Transform: Remove or mask PII from files
# Creates cleaned data + audit trail of what was changed
nopii transform data.csv transformed_data.csv --audit-report audit.json

# Report: Generate human-readable reports from audit data
# Convert JSON audit logs into HTML/Markdown with charts and summaries
nopii report audit.json --format html --output report.html

# Diff: Compare original vs transformed files
# Shows exactly what PII was detected and how it was changed
nopii diff original.csv transformed.csv

# Policy: Manage detection and transformation rules
# Validate YAML policies or create new ones
nopii policy validate my_policy.yaml

# Create a new policy file
nopii policy create new_policy.yaml --default-action redact

Note: the CLI is also available as 'no-pii' (alias)

nopii scan data.csv --format json


Exit codes:

- `0` when no PII is detected
- `1` when PII is found
- Non‑zero on errors

### Python SDK / Core

The SDK provides two levels of access: low-level core classes for fine-grained control and a high-level client for quick operations.

#### Core Classes (Low-level API)

Use Scanner and Transform classes directly when you need precise control over detection and transformation:

```python
from nopii.core.scanner import Scanner
from nopii.core.transform import Transform
from nopii.policy.loader import create_default_policy, load_policy

# Load policy (default or custom YAML)
policy = create_default_policy()  # or load_policy("policy.yaml")

# Scanner: Detect PII without modifying data
# Returns list of Finding objects with location, confidence, and PII type
scanner = Scanner(policy)
findings = scanner.scan_text("Contact john@example.com or 555-123-4567")
print(f"Found {len(findings)} findings")

# Transform: Apply policy actions (mask, redact, hash) to PII
# Returns tuple of (cleaned_text, findings_list)
transformer = Transform(policy)
transformed_text, findings = transformer.transform_text("Contact john@example.com or 555-123-4567")
print(f"Transformed: {transformed_text}")

# DataFrame operations (requires pandas extra)
import pandas as pd
df = pd.DataFrame({"email": ["john@example.com"], "phone": ["555-123-4567"]})

# Scan entire DataFrame, get detailed results per column
scan_result = scanner.scan_dataframe(df, dataset_name="contacts")

# Transform DataFrame, get cleaned data + comprehensive audit report
df_transformed, audit_report = transformer.transform_dataframe(df, dataset_name="contacts")
print(f"Coverage: {audit_report.coverage_score:.1%}, Risk: {audit_report.residual_risk}")

High-Level Client (Quick Operations)

Use NoPIIClient for simple, one-line operations with sensible defaults:

from nopii.sdk import NoPIIClient

client = NoPIIClient()

# Scan text
findings = client.scanner.scan_text("Contact john@example.com")
print(f"Found {len(findings)} PII items")

# Transform text
result = client.transform_text("Contact john@example.com")
print(result)  # "Contact ****@example.com"

DataFrame Processing

import pandas as pd
from nopii.core.scanner import Scanner
from nopii.core.transform import Transform
from nopii.policy.loader import create_default_policy

policy = create_default_policy()
df = pd.read_csv("data.csv")

scanner = Scanner(policy)
transformer = Transform(policy)

# Load and process data
df = pd.read_csv("customer_data.csv")
scan_result = scanner.scan_dataframe(df, dataset_name="customers")
transformed_df, audit = transformer.transform_dataframe(df, dataset_name="customers")

# Review results
print(f"Processed {len(df)} rows, coverage: {audit.coverage_score:.1%}")
print(f"PII types found: {[f.pii_type for f in scan_result.findings]}")
print(f"Columns affected: {len(audit.column_reports)}")

Performance & Streaming

NoPII is designed for efficient processing of large datasets:

Memory-Efficient Streaming:

CLI and SDK automatically stream .csv and .txt/.md files to avoid loading entire files into memory
Processes files line-by-line or in configurable chunks (default: 1000 rows)
Suitable for multi-GB files on standard hardware

In-Memory Operations:

JSON/Parquet files and DataFrame operations require pandas and load data into memory
Recommended for files under 1GB or when you need full DataFrame functionality
For very large JSON, consider line-delimited JSON (JSONL) and chunked processing.
Coverage metrics for streaming scans are computed without a full DataFrame, using policy rules and detected items.

Policy Configuration (YAML)

name: my_policy
default_action: mask
thresholds:
  min_confidence: 0.7
rules:
  - match: email
    action: mask
    options:
      mask_char: "*"
  - match: phone
    action: redact
  - match: ssn
    action: hash
    options:
      algorithm: sha256
exceptions: []

Rule Options Validation

Policy rule options are validated based on the rule action:

mask
- mask_char: string
- preserve_first: integer
- preserve_last: integer
hash
- algorithm: one of md5, sha1, sha256, sha512
- max_length: integer
tokenize
- deterministic: boolean
- token_length: integer

Invalid or mismatched types will be reported by PolicyValidator as errors when loading/validating a policy.

Performance

Streams large CSV/text files to avoid memory issues
Processes multi-GB files efficiently
DataFrame operations require pandas (in-memory)

License

This project is licensed under the Apache License, Version 2.0 - see the LICENSE file for details.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ay-mich

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.3

Oct 22, 2025

0.1.1

Oct 22, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nopii-0.1.3.tar.gz (77.3 kB view details)

Uploaded Oct 22, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nopii-0.1.3-py3-none-any.whl (78.5 kB view details)

Uploaded Oct 22, 2025 Python 3

File details

Details for the file nopii-0.1.3.tar.gz.

File metadata

Download URL: nopii-0.1.3.tar.gz
Upload date: Oct 22, 2025
Size: 77.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nopii-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`e4b3038426364368a9bfc2e377c93b94537db1a4aaf1fa713fd065d5f9005269`
MD5	`a1081e774f6e7bea3e4ae612422efb41`
BLAKE2b-256	`af50730e7cc19826e3171195769b375c1e71b6214fd05a350f5620cd539297f3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for nopii-0.1.3.tar.gz:

Publisher: publish.yml on ay-mich/nopii

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: nopii-0.1.3.tar.gz
- Subject digest: e4b3038426364368a9bfc2e377c93b94537db1a4aaf1fa713fd065d5f9005269
- Sigstore transparency entry: 629427661
- Sigstore integration time: Oct 22, 2025
Source repository:
- Permalink: ay-mich/nopii@f3a950eca72bd0cfb2a224c03fb1603ce7425735
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/ay-mich
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@f3a950eca72bd0cfb2a224c03fb1603ce7425735
- Trigger Event: push

File details

Details for the file nopii-0.1.3-py3-none-any.whl.

File metadata

Download URL: nopii-0.1.3-py3-none-any.whl
Upload date: Oct 22, 2025
Size: 78.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nopii-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`18a0d5e489f294490339eb70b45dff2f6ea994a99a56b563b54751795946e859`
MD5	`de95a5fb755aa69956b8ce738f70c1a2`
BLAKE2b-256	`1c382ee96bc2a5282ac04afc44ed54d43ef7744be8209259625ea755319c66af`

See more details on using hashes here.

Provenance

The following attestation bundles were made for nopii-0.1.3-py3-none-any.whl:

Publisher: publish.yml on ay-mich/nopii

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: nopii-0.1.3-py3-none-any.whl
- Subject digest: 18a0d5e489f294490339eb70b45dff2f6ea994a99a56b563b54751795946e859
- Sigstore transparency entry: 629427668
- Sigstore integration time: Oct 22, 2025
Source repository:
- Permalink: ay-mich/nopii@f3a950eca72bd0cfb2a224c03fb1603ce7425735
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/ay-mich
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@f3a950eca72bd0cfb2a224c03fb1603ce7425735
- Trigger Event: push

nopii 0.1.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

NoPII

Features

🔍 PII Detection

🛡️ Transformation Strategies

📊 Data Processing

📋 Policy Management

🔧 CLI & SDK

Installation

Optional Dependencies

Quick Start

CLI Usage

Note: the CLI is also available as 'no-pii' (alias)

nopii scan data.csv --format json

High-Level Client (Quick Operations)

DataFrame Processing

Performance & Streaming

Policy Configuration (YAML)

Rule Options Validation

Performance

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance