Skip to main content

Python SDK for Sentinel PII Redaction - State-of-the-art PII detection and redaction using fine-tuned Granite models

Project description

Sentinel PII SDK

State-of-the-art PII detection and redaction using the Sentinel model

Sentinel PII SDK is a Python library for identifying and redacting Personally Identifiable Information (PII) in text.

Features

  • High-accuracy PII detection (95%+ recall)
  • Multiple handling modes: TAG, REDACT, or REPLACE
  • Batch processing support

Installation

From PyPI

pip install sentinel-pii-sdk

With faker support for REPLACE mode:

pip install 'sentinel-pii-sdk[faker]'

From Source

git clone https://github.com/cernis-intelligence/sentinel-pii-sdk.git
cd sentinel-pii-sdk
pip install -e .

Quick Start

from sentinel_pii import SentinelPIIRedactor

# Initialize (model loads from HuggingFace on first use)
redactor = SentinelPIIRedactor()

# Detect PII in text
text = "My name is John Smith and my email is john@email.com"
result = redactor.redact_text(text)
print(result)
# Output: "My name is [PERSON_NAME] and my email is [EMAIL_ADDRESS]"

Usage Examples

Basic PII Detection

from sentinel_pii import SentinelPIIRedactor, PIIHandlingMode

redactor = SentinelPIIRedactor()

text = "Contact John Smith at john@email.com or call (555) 123-4567"

# TAG mode - Show PII categories
result = redactor.redact_text(text, mode=PIIHandlingMode.TAG)
print(result)
# "Contact [PERSON_NAME] at [EMAIL_ADDRESS] or call [PHONE_NUMBER]"

# REDACT mode - Same as TAG
result = redactor.redact_text(text, mode=PIIHandlingMode.REDACT)
print(result)
# "Contact [PERSON_NAME] at [EMAIL_ADDRESS] or call [PHONE_NUMBER]"

# REPLACE mode - Replace with fake data (requires faker)
result = redactor.redact_text(text, mode=PIIHandlingMode.REPLACE)
print(result)
# "Contact Jane Doe at jane.doe@example.com or call (555) 987-6543"

Batch Processing

from sentinel_pii import detect_pii_batch, PIIHandlingMode

documents = [
    "My email is john@email.com",
    "Patient DOB: 1990-05-15, diagnosed with diabetes"
]

results = detect_pii_batch(documents, mode=PIIHandlingMode.TAG)
for result in results:
    print(result)

Dataset Cleaning

from sentinel_pii import clean_dataset, PIIHandlingMode

# Clean a JSONL dataset file
clean_dataset(
    input_filename="input_data.jsonl",
    output_filename="output_data.jsonl",
    mode=PIIHandlingMode.TAG
)

Supported PII Categories

The Sentinel model detects 20+ PII categories:

Identity: PERSON_NAME, USERNAME, AGE, GENDER, DEMOGRAPHIC_GROUP

Contact: EMAIL_ADDRESS, PHONE_NUMBER, STREET_ADDRESS, CITY, STATE, POSTCODE, COUNTRY

Dates: DATE, DATE_OF_BIRTH

ID Numbers: PERSONAL_ID, PASSPORT, DRIVERLICENSE

Financial: CREDIT_CARD_INFO, BANKING_NUMBER

Security: PASSWORD, SECURE_CREDENTIAL

Medical: MEDICAL_CONDITION

Other: ORGANIZATION_NAME, DOMAIN_NAME, NATIONALITY, RELIGIOUS_AFFILIATION

API Reference

SentinelPIIRedactor

Main class for PII detection.

redactor = SentinelPIIRedactor(pii_categories=None)

Parameters:

  • pii_categories (optional): Custom PII categories string

Methods:

  • redact_text(text, mode=PIIHandlingMode.TAG, locale="en_US") - Process single text
  • detect_pii(documents, mode=PIIHandlingMode.TAG, locale="en_US", show_progress=True) - Process list of documents

Utility Functions

  • detect_pii_batch(documents, mode=PIIHandlingMode.TAG, locale="en_US") - Batch processing
  • clean_dataset(input_filename, output_filename, mode=PIIHandlingMode.TAG, locale="en_US") - Clean JSONL files

PIIHandlingMode

Enum for handling modes:

  • PIIHandlingMode.TAG - Show PII categories in brackets
  • PIIHandlingMode.REDACT - Same as TAG
  • PIIHandlingMode.REPLACE - Replace with fake data (requires faker)

Model Information

  • Model: cernis-intelligence/sentinel on HuggingFace
  • Performance: 95%+ recall, ~100 docs/min on GPU
  • License: Apache 2.0

Requirements

  • Python >= 3.9
  • transformers >= 4.36.0
  • torch >= 2.0.0
  • accelerate >= 0.20.0
  • tqdm >= 4.65.0
  • faker >= 20.0.0 (optional, for REPLACE mode)

Examples

The examples/ directory contains working sample scripts:

# Basic single-text PII detection
python3.11 examples/basic_usage.py

# Process multiple documents at once
python3.11 examples/batch_processing.py

# Clean JSONL dataset files
python3.11 examples/dataset_cleaning.py

# Validate package structure (no model download)
python3.11 examples/test_all_examples.py

You can also use the included sample_data.jsonl for testing:

from sentinel_pii import clean_dataset, PIIHandlingMode

clean_dataset(
    "examples/sample_data.jsonl",
    "output.jsonl",
    mode=PIIHandlingMode.TAG
)

Contributing

Contributions welcome! Please submit a Pull Request.

License

Apache 2.0 License - see LICENSE file for details.

Support

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sentinel_pii_sdk-0.1.0.tar.gz (12.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sentinel_pii_sdk-0.1.0-py3-none-any.whl (13.1 kB view details)

Uploaded Python 3

File details

Details for the file sentinel_pii_sdk-0.1.0.tar.gz.

File metadata

  • Download URL: sentinel_pii_sdk-0.1.0.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.12

File hashes

Hashes for sentinel_pii_sdk-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2237923604f2466aea4cab8a17319d57d01c79945c310c3f270463a4a55ecdcc
MD5 0c4f847871f58413bf8e4254c0a34563
BLAKE2b-256 0b0ea705a6fab484893c8c6fb3ba718bb3f9cc01a27397cd2e69015185d65306

See more details on using hashes here.

File details

Details for the file sentinel_pii_sdk-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for sentinel_pii_sdk-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9aa772b41c9746f0fc4442da49a39bb78959b57d770b38cc594b119f315b2a61
MD5 afea282841028f5d4c3758d68971c7f5
BLAKE2b-256 6a91fb5b761aa08ef5e8dc46e29e9939650e14ad19fc256715dedf2758abc22d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page