A package to mask PII in text using transformers
Project description
ai4privacy python module 🛡️
A Python package for state-of-the-art PII detection and masking using advanced transformer models.
Features
- Protect Mode: Anonymize text by replacing detected PII with placeholders.
- Observe Mode: Get statistics and a detailed "privacy mask" of found PII without altering the original text.
- Multiple Models: Built-in support for:
- English-specific detection.
- Multilingual detection.
- Categorical detection (e.g.,
GIVENNAME,EMAIL,CITY).
- Tunable Sensitivity: An adjustable
score_thresholdto balance detection accuracy with false positives. - Verbose & Developer Modes: Rich outputs for detailed analysis and debugging.
Installation
pip install ai4privacy
Quick Start
The simplest way to use the library is to call the protect function, which masks PII with placeholders.
from ai4privacy import protect
text = "Email me at developers@ai4privacy.com or call me at +41763223001."
masked_text = protect(text)
print(masked_text)
# Expected Output: Email me at [PII_1] or call me at [PII_2]
Advanced Usage
Using Different Models
You can easily switch between models using the multilingual and classify_pii flags.
from ai4privacy import protect
text = "Je m'appelle Pierre et j'habite à Paris."
# Use the multilingual model for non-English text
masked_multilingual = protect(text, multilingual=True)
print(f"Multilingual: {masked_multilingual}")
# Expected Output: Multilingual: Je m'appelle [PII_1] et j'habite à [PII_2]
# Use the categorical model to see the PII types
details = protect(text, classify_pii=True, verbose=True)
print(f"Categorical Labels: {[r['label'] for r in details['replacements']]}")
# Expected Output: Categorical Labels: ['GIVENNAME', 'CITY']
Observe Mode
To analyze text without changing it, use observe(). It returns a dictionary containing statistics and the privacy_mask—a detailed list of all PII entities found.
from ai4privacy import observe
import json
text = "My name is Alice and I live in Berlin."
report = observe(text, classify_pii=True)
print(json.dumps(report, indent=2))
{
"num_texts_processed": 1,
"num_texts_with_pii": 1,
"pii_entity_counts": {
"GIVENNAME": 1,
"CITY": 1
},
"total_pii_entities_found": 2,
"privacy_mask": [
{
"label": "GIVENNAME",
"start": 11,
"end": 16,
"activation": 0.98,
"value": "Alice"
},
{
"label": "CITY",
"start": 30,
"end": 36,
"activation": 0.99,
"value": "Berlin"
}
]
}
Verbose and Developer Modes
Set verbose=True to get a dictionary containing the original text, masked text, and replacement details. For deep debugging, developer_verbose=True adds a token-by-token breakdown of the model's predictions.
from ai4privacy import protect
text = "Senden Sie es an Eva Schmidt."
details = protect(text, classify_pii=True, verbose=True)
print(details['replacements'])
# Expected Output: [{'label': 'GIVENNAME', 'start': 18, 'end': 22, ...}, {'label': 'SURNAME', 'start': 23, 'end': 30, ...}]
Adjusting Sensitivity
The score_threshold (default: 0.01) controls how confident the model must be to flag a token as PII.
- A lower value increases sensitivity (finds more PII, but may have more false positives).
- A higher value increases precision (detections are more likely correct, but may miss some PII).
from ai4privacy import protect
text = "Maybe this is a name, maybe not. Contact John."
# High precision (less likely to flag "Maybe")
masked_high_prec = protect(text, score_threshold=0.5)
print(f"High Precision: {masked_high_prec}")
# Expected Output: High Precision: Maybe this is a name, maybe not. Contact [PII_1]
# High sensitivity (more likely to flag "Maybe" if the model is unsure)
masked_high_sens = protect(text, score_threshold=0.001)
print(f"High Sensitivity: {masked_high_sens}")
Disclaimer 📢
Ai4Privacy is trained on the world's largest open-source privacy dataset. For production use, please evaluate results carefully on your own datasets. For assistance, contact us at our website https://ai4privacy.com or email support@ai4privacy.com.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ai4privacy-0.5.0.tar.gz.
File metadata
- Download URL: ai4privacy-0.5.0.tar.gz
- Upload date:
- Size: 12.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b23498e8e8c87c4f106388d5fc93e4a7c3c3253f0f4d64a7b8fba57f0cad4c40
|
|
| MD5 |
71ff15c5345635616109bfdb3a4c0690
|
|
| BLAKE2b-256 |
f2f0707d94ccc38b4e364eb31f6aaf70c84d4d3f9437e527ff9c83a3f0caa07b
|
File details
Details for the file ai4privacy-0.5.0-py3-none-any.whl.
File metadata
- Download URL: ai4privacy-0.5.0-py3-none-any.whl
- Upload date:
- Size: 12.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ebeda689e43e39a293cbe8e94650196bdac545a9ec28502f978d68661b4db9ef
|
|
| MD5 |
749e25f288e052457c727b545fdbe588
|
|
| BLAKE2b-256 |
5236c69787862218964f40606cce94daffcfa9aa269a35e24611cf1a4c39d316
|