Python SDK for Sentinel PII Redaction - State-of-the-art PII detection and redaction using fine-tuned Granite models
Project description
Sentinel PII SDK
State-of-the-art PII detection and redaction using the Sentinel model
Sentinel PII SDK is a Python library for identifying and redacting Personally Identifiable Information (PII) in text.
Features
- High-accuracy PII detection (95%+ recall)
- Multiple handling modes: TAG, REDACT, or REPLACE
- Batch processing support
Installation
From PyPI
pip install sentinel-pii-sdk
With faker support for REPLACE mode:
pip install 'sentinel-pii-sdk[faker]'
From Source
git clone https://github.com/cernis-intelligence/sentinel-pii-sdk.git
cd sentinel-pii-sdk
pip install -e .
Quick Start
from sentinel_pii import SentinelPIIRedactor
# Initialize (model loads from HuggingFace on first use)
redactor = SentinelPIIRedactor()
# Detect PII in text
text = "My name is John Smith and my email is john@email.com"
result = redactor.redact_text(text)
print(result)
# Output: "My name is [PERSON_NAME] and my email is [EMAIL_ADDRESS]"
Usage Examples
Basic PII Detection
from sentinel_pii import SentinelPIIRedactor, PIIHandlingMode
redactor = SentinelPIIRedactor()
text = "Contact John Smith at john@email.com or call (555) 123-4567"
# TAG mode - Show PII categories
result = redactor.redact_text(text, mode=PIIHandlingMode.TAG)
print(result)
# "Contact [PERSON_NAME] at [EMAIL_ADDRESS] or call [PHONE_NUMBER]"
# REDACT mode - Same as TAG
result = redactor.redact_text(text, mode=PIIHandlingMode.REDACT)
print(result)
# "Contact [PERSON_NAME] at [EMAIL_ADDRESS] or call [PHONE_NUMBER]"
# REPLACE mode - Replace with fake data (requires faker)
result = redactor.redact_text(text, mode=PIIHandlingMode.REPLACE)
print(result)
# "Contact Jane Doe at jane.doe@example.com or call (555) 987-6543"
Batch Processing
from sentinel_pii import detect_pii_batch, PIIHandlingMode
documents = [
"My email is john@email.com",
"Patient DOB: 1990-05-15, diagnosed with diabetes"
]
results = detect_pii_batch(documents, mode=PIIHandlingMode.TAG)
for result in results:
print(result)
Dataset Cleaning
from sentinel_pii import clean_dataset, PIIHandlingMode
# Clean a JSONL dataset file
clean_dataset(
input_filename="input_data.jsonl",
output_filename="output_data.jsonl",
mode=PIIHandlingMode.TAG
)
Supported PII Categories
The Sentinel model detects 20+ PII categories:
Identity: PERSON_NAME, USERNAME, AGE, GENDER, DEMOGRAPHIC_GROUP
Contact: EMAIL_ADDRESS, PHONE_NUMBER, STREET_ADDRESS, CITY, STATE, POSTCODE, COUNTRY
Dates: DATE, DATE_OF_BIRTH
ID Numbers: PERSONAL_ID, PASSPORT, DRIVERLICENSE
Financial: CREDIT_CARD_INFO, BANKING_NUMBER
Security: PASSWORD, SECURE_CREDENTIAL
Medical: MEDICAL_CONDITION
Other: ORGANIZATION_NAME, DOMAIN_NAME, NATIONALITY, RELIGIOUS_AFFILIATION
API Reference
SentinelPIIRedactor
Main class for PII detection.
redactor = SentinelPIIRedactor(pii_categories=None)
Parameters:
pii_categories(optional): Custom PII categories string
Methods:
redact_text(text, mode=PIIHandlingMode.TAG, locale="en_US")- Process single textdetect_pii(documents, mode=PIIHandlingMode.TAG, locale="en_US", show_progress=True)- Process list of documents
Utility Functions
detect_pii_batch(documents, mode=PIIHandlingMode.TAG, locale="en_US")- Batch processingclean_dataset(input_filename, output_filename, mode=PIIHandlingMode.TAG, locale="en_US")- Clean JSONL files
PIIHandlingMode
Enum for handling modes:
PIIHandlingMode.TAG- Show PII categories in bracketsPIIHandlingMode.REDACT- Same as TAGPIIHandlingMode.REPLACE- Replace with fake data (requires faker)
Model Information
- Model: cernis-intelligence/sentinel on HuggingFace
- Performance: 95%+ recall, ~100 docs/min on GPU
- License: Apache 2.0
Requirements
- Python >= 3.9
- transformers >= 4.36.0
- torch >= 2.0.0
- accelerate >= 0.20.0
- tqdm >= 4.65.0
- faker >= 20.0.0 (optional, for REPLACE mode)
Examples
The examples/ directory contains working sample scripts:
# Basic single-text PII detection
python3.11 examples/basic_usage.py
# Process multiple documents at once
python3.11 examples/batch_processing.py
# Clean JSONL dataset files
python3.11 examples/dataset_cleaning.py
# Validate package structure (no model download)
python3.11 examples/test_all_examples.py
You can also use the included sample_data.jsonl for testing:
from sentinel_pii import clean_dataset, PIIHandlingMode
clean_dataset(
"examples/sample_data.jsonl",
"output.jsonl",
mode=PIIHandlingMode.TAG
)
Contributing
Contributions welcome! Please submit a Pull Request.
License
Apache 2.0 License - see LICENSE file for details.
Support
- HuggingFace: cernis-intelligence/sentinel
- Issues: GitHub Issues
Acknowledgments
- Built on IBM Granite 4.0
- Training data from AI4Privacy
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sentinel_pii_sdk-0.1.0.tar.gz.
File metadata
- Download URL: sentinel_pii_sdk-0.1.0.tar.gz
- Upload date:
- Size: 12.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2237923604f2466aea4cab8a17319d57d01c79945c310c3f270463a4a55ecdcc
|
|
| MD5 |
0c4f847871f58413bf8e4254c0a34563
|
|
| BLAKE2b-256 |
0b0ea705a6fab484893c8c6fb3ba718bb3f9cc01a27397cd2e69015185d65306
|
File details
Details for the file sentinel_pii_sdk-0.1.0-py3-none-any.whl.
File metadata
- Download URL: sentinel_pii_sdk-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9aa772b41c9746f0fc4442da49a39bb78959b57d770b38cc594b119f315b2a61
|
|
| MD5 |
afea282841028f5d4c3758d68971c7f5
|
|
| BLAKE2b-256 |
6a91fb5b761aa08ef5e8dc46e29e9939650e14ad19fc256715dedf2758abc22d
|