Skip to main content

A package to mask PII in text using transformers

Project description

ai4privacy python module 🛡️

A Python package for state-of-the-art PII detection and masking using advanced transformer models.


Features

  • Protect Mode: Anonymize text by replacing detected PII with placeholders.
  • Observe Mode: Get statistics and a detailed "privacy mask" of found PII without altering the original text.
  • Multiple Models: Built-in support for:
    • English-specific detection.
    • Multilingual detection.
    • Categorical detection (e.g., GIVENNAME, EMAIL, CITY).
  • Tunable Sensitivity: An adjustable score_threshold to balance detection accuracy with false positives.
  • Verbose & Developer Modes: Rich outputs for detailed analysis and debugging.

Installation

pip install ai4privacy

Quick Start

The simplest way to use the library is to call the protect function, which masks PII with placeholders.

from ai4privacy import protect

text = "Email me at developers@ai4privacy.com or call me at +41763223001."
masked_text = protect(text)

print(masked_text)
# Expected Output: Email me at [PII_1] or call me at [PII_2]

Advanced Usage

Using Different Models

You can easily switch between models using the multilingual and classify_pii flags.

from ai4privacy import protect

text = "Je m'appelle Pierre et j'habite à Paris."

# Use the multilingual model for non-English text
masked_multilingual = protect(text, multilingual=True)
print(f"Multilingual: {masked_multilingual}")
# Expected Output: Multilingual: Je m'appelle [PII_1] et j'habite à [PII_2]

# Use the categorical model to see the PII types
details = protect(text, classify_pii=True, verbose=True)
print(f"Categorical Labels: {[r['label'] for r in details['replacements']]}")
# Expected Output: Categorical Labels: ['GIVENNAME', 'CITY']

Observe Mode

To analyze text without changing it, use observe(). It returns a dictionary containing statistics and the privacy_mask—a detailed list of all PII entities found.

from ai4privacy import observe
import json

text = "My name is Alice and I live in Berlin."
report = observe(text, classify_pii=True)

print(json.dumps(report, indent=2))
{
  "num_texts_processed": 1,
  "num_texts_with_pii": 1,
  "pii_entity_counts": {
    "GIVENNAME": 1,
    "CITY": 1
  },
  "total_pii_entities_found": 2,
  "privacy_mask": [
    {
      "label": "GIVENNAME",
      "start": 11,
      "end": 16,
      "activation": 0.98,
      "value": "Alice"
    },
    {
      "label": "CITY",
      "start": 30,
      "end": 36,
      "activation": 0.99,
      "value": "Berlin"
    }
  ]
}

Verbose and Developer Modes

Set verbose=True to get a dictionary containing the original text, masked text, and replacement details. For deep debugging, developer_verbose=True adds a token-by-token breakdown of the model's predictions.

from ai4privacy import protect

text = "Senden Sie es an Eva Schmidt."
details = protect(text, classify_pii=True, verbose=True)

print(details['replacements'])
# Expected Output: [{'label': 'GIVENNAME', 'start': 18, 'end': 22, ...}, {'label': 'SURNAME', 'start': 23, 'end': 30, ...}]

Adjusting Sensitivity

The score_threshold (default: 0.01) controls how confident the model must be to flag a token as PII.

  • A lower value increases sensitivity (finds more PII, but may have more false positives).
  • A higher value increases precision (detections are more likely correct, but may miss some PII).
from ai4privacy import protect

text = "Maybe this is a name, maybe not. Contact John."

# High precision (less likely to flag "Maybe")
masked_high_prec = protect(text, score_threshold=0.5) 
print(f"High Precision: {masked_high_prec}")
# Expected Output: High Precision: Maybe this is a name, maybe not. Contact [PII_1]

# High sensitivity (more likely to flag "Maybe" if the model is unsure)
masked_high_sens = protect(text, score_threshold=0.001)
print(f"High Sensitivity: {masked_high_sens}")

Disclaimer 📢

Ai4Privacy is trained on the world's largest open-source privacy dataset. For production use, please evaluate results carefully on your own datasets. For assistance, contact us at our website https://ai4privacy.com or email support@ai4privacy.com.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai4privacy-0.5.0.tar.gz (12.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ai4privacy-0.5.0-py3-none-any.whl (12.8 kB view details)

Uploaded Python 3

File details

Details for the file ai4privacy-0.5.0.tar.gz.

File metadata

  • Download URL: ai4privacy-0.5.0.tar.gz
  • Upload date:
  • Size: 12.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.12

File hashes

Hashes for ai4privacy-0.5.0.tar.gz
Algorithm Hash digest
SHA256 b23498e8e8c87c4f106388d5fc93e4a7c3c3253f0f4d64a7b8fba57f0cad4c40
MD5 71ff15c5345635616109bfdb3a4c0690
BLAKE2b-256 f2f0707d94ccc38b4e364eb31f6aaf70c84d4d3f9437e527ff9c83a3f0caa07b

See more details on using hashes here.

File details

Details for the file ai4privacy-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: ai4privacy-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 12.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.12

File hashes

Hashes for ai4privacy-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ebeda689e43e39a293cbe8e94650196bdac545a9ec28502f978d68661b4db9ef
MD5 749e25f288e052457c727b545fdbe588
BLAKE2b-256 5236c69787862218964f40606cce94daffcfa9aa269a35e24611cf1a4c39d316

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page