Skip to main content

A package to mask PII in text using transformers

Project description

ai4privacy python module 🛡️

A Python package for state-of-the-art PII detection and masking using advanced transformer models.


Features

  • Protect Mode: Anonymize text by replacing detected PII with placeholders.
  • Observe Mode: Get statistics and a detailed "privacy mask" of found PII without altering the original text.
  • Multiple Models: Built-in support for:
    • English-specific detection.
    • Multilingual detection.
    • Categorical detection (e.g., GIVENNAME, EMAIL, CITY).
  • Tunable Sensitivity: An adjustable score_threshold to balance detection accuracy with false positives.
  • Verbose & Developer Modes: Rich outputs for detailed analysis and debugging.

Installation

pip install ai4privacy

Quick Start

The simplest way to use the library is to call the protect function, which masks PII with placeholders.

from ai4privacy import protect

text = "Email me at developers@ai4privacy.com or call me at +41763223001."
masked_text = protect(text)

print(masked_text)
# Expected Output: Email me at [PII_1] or call me at [PII_2]

Advanced Usage

Using Different Models

You can easily switch between models using the multilingual and classify_pii flags.

from ai4privacy import protect

text = "Je m'appelle Pierre et j'habite à Paris."

# Use the multilingual model for non-English text
masked_multilingual = protect(text, multilingual=True)
print(f"Multilingual: {masked_multilingual}")
# Expected Output: Multilingual: Je m'appelle [PII_1] et j'habite à [PII_2]

# Use the categorical model to see the PII types
details = protect(text, classify_pii=True, verbose=True)
print(f"Categorical Labels: {[r['label'] for r in details['replacements']]}")
# Expected Output: Categorical Labels: ['GIVENNAME', 'CITY']

Observe Mode

To analyze text without changing it, use observe(). It returns a dictionary containing statistics and the privacy_mask—a detailed list of all PII entities found.

from ai4privacy import observe
import json

text = "My name is Alice and I live in Berlin."
report = observe(text, classify_pii=True)

print(json.dumps(report, indent=2))
{
  "num_texts_processed": 1,
  "num_texts_with_pii": 1,
  "pii_entity_counts": {
    "GIVENNAME": 1,
    "CITY": 1
  },
  "total_pii_entities_found": 2,
  "privacy_mask": [
    {
      "label": "GIVENNAME",
      "start": 11,
      "end": 16,
      "activation": 0.98,
      "value": "Alice"
    },
    {
      "label": "CITY",
      "start": 30,
      "end": 36,
      "activation": 0.99,
      "value": "Berlin"
    }
  ]
}

Verbose and Developer Modes

Set verbose=True to get a dictionary containing the original text, masked text, and replacement details. For deep debugging, developer_verbose=True adds a token-by-token breakdown of the model's predictions.

from ai4privacy import protect

text = "Senden Sie es an Eva Schmidt."
details = protect(text, classify_pii=True, verbose=True)

print(details['replacements'])
# Expected Output: [{'label': 'GIVENNAME', 'start': 18, 'end': 22, ...}, {'label': 'SURNAME', 'start': 23, 'end': 30, ...}]

Adjusting Sensitivity

The score_threshold (default: 0.01) controls how confident the model must be to flag a token as PII.

  • A lower value increases sensitivity (finds more PII, but may have more false positives).
  • A higher value increases precision (detections are more likely correct, but may miss some PII).
from ai4privacy import protect

text = "Maybe this is a name, maybe not. Contact John."

# High precision (less likely to flag "Maybe")
masked_high_prec = protect(text, score_threshold=0.5) 
print(f"High Precision: {masked_high_prec}")
# Expected Output: High Precision: Maybe this is a name, maybe not. Contact [PII_1]

# High sensitivity (more likely to flag "Maybe" if the model is unsure)
masked_high_sens = protect(text, score_threshold=0.001)
print(f"High Sensitivity: {masked_high_sens}")

Disclaimer 📢

Ai4Privacy is trained on the world's largest open-source privacy dataset. For production use, please evaluate results carefully on your own datasets. For assistance, contact us at our website https://ai4privacy.com or email support@ai4privacy.com.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai4privacy-0.3.3.tar.gz (9.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ai4privacy-0.3.3-py3-none-any.whl (9.4 kB view details)

Uploaded Python 3

File details

Details for the file ai4privacy-0.3.3.tar.gz.

File metadata

  • Download URL: ai4privacy-0.3.3.tar.gz
  • Upload date:
  • Size: 9.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for ai4privacy-0.3.3.tar.gz
Algorithm Hash digest
SHA256 78207a6c877c6d3a2772daea47afafd93e34bd01b21a930c65b6f296124d4d17
MD5 a0184781d95f880ea3fa7a21e853a75e
BLAKE2b-256 5846dfdffc041d061ada267f939e42f71abc39b107c391cafa7982fedb234df8

See more details on using hashes here.

File details

Details for the file ai4privacy-0.3.3-py3-none-any.whl.

File metadata

  • Download URL: ai4privacy-0.3.3-py3-none-any.whl
  • Upload date:
  • Size: 9.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for ai4privacy-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 ae7144e8adf580747f33b16753c8505363869cdfe6ea7ea0b43c7ceb8c788063
MD5 8ceb5172a39166923ab60b0a3a4843b8
BLAKE2b-256 afd2836f46d6295401336f1fba080766e2391cf359c449e2ba41a074f837c2c2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page