Skip to main content

Easy, lightweight PII scanning for network requests and logs—powered by Microsoft Presidio and en_spacy_pii_fast

Project description

piiscan

PyPI version

A simple Python library that detects PII in text. Optimized for network traffic and structured data (e.g. JSON Objects, HTTP Requests).

Powered by beki/en_spacy_pii_fast and Microsoft Presidio.

Installation

This package is published on PyPi, and can be installed using pip.

pip3 install piiscan

Usage

The module exposes two methods:

  • scan() returns a list of Presidio RecognizerResult objects.
  • annotate() returns a human-readable list of all fields detected and the corresponding entity type.

Scanning structured data for PII

import piiscan

original_value = '''
{
    "name": "John Doe",
    "email": "john@doe.com",
    "address": "123 Front Street, San Francisco, CA",
    "phone": "+1 (415) 123-4567"
}
'''

detected_pii = piiscan.scan(original_value)

print(detected_pii)

Expected result printed in console:

[
    type: EMAIL_ADDRESS, start: 41, end: 53, score: 1.0, 
    type: PERSON, start: 16, end: 24, score: 0.85, 
    type: LOCATION, start: 72, end: 107, score: 0.85, 
    type: URL, start: 46, end: 53, score: 0.5, 
    type: PHONE_NUMBER, start: 127, end: 141, score: 0.4
]

Annotating data to display discovered PII

import piiscan

original_value = '''
{
    "name": "John Doe",
    "email": "john@doe.com",
    "address": "123 Front Street, San Francisco, CA",
    "phone": "+1 (415) 123-4567"
}
'''

detected_pii = piiscan.scan(original_value)

annotated = piiscan.annotate(original_value, detected_pii)

print(annotated)

Expected result printed in console (cleaned for readability):

{  
    "name": "('John Doe', 'PERSON')",  
    "email": "('john@doe.com', 'EMAIL_ADDRESS'), ('doe.com', 'URL')",   
    "address": "('123 Front Street, San Francisco, CA', 'LOCATION')", 
    "phone": "('+1 (415) 123-4567', 'PHONE_NUMBER')"
}

Credits

The basis for this project is derived from Pixie Labs' blog post on using NLP to anonymize sensitive PII in structured data, and its corresponding Hugging Face Space.

This project uses Benjamin Kilimnik's en_spacy_pii_fast spaCy model for NER. It is trained using a dataset derived from Privy's synthetic payload generator. The entire model is less than 7 MiB!

This library was designed for use in resource-constrained environments, hence the small 7 MiB en_spacy_pii_fast model. However, if you would like to increased accuracy there are other alternatives described in the original blog post.

This library also makes use of Microsoft Presidio and spaCy.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

piiscan-0.1.7.tar.gz (4.8 kB view hashes)

Uploaded Source

Built Distribution

piiscan-0.1.7-py3-none-any.whl (5.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page