Skip to main content

Easy, lightweight PII scanning for network requests and logs—powered by Microsoft Presidio and en_spacy_pii_fast

Project description

piiscan

PyPI version

A simple Python library that detects PII in text. Optimized for network traffic and structured data (e.g. JSON Objects, HTTP Requests).

Powered by beki/en_spacy_pii_fast and Microsoft Presidio.

Installation

This package is published on PyPi, and can be installed using pip.

pip3 install piiscan

Usage

The module exposes two methods:

  • scan() returns a list of Presidio RecognizerResult objects.
  • annotate() returns a human-readable list of all fields detected and the corresponding entity type.

Scanning structured data for PII

import piiscan

original_value = '''
{
    "name": "John Doe",
    "email": "john@doe.com",
    "address": "123 Front Street, San Francisco, CA",
    "phone": "+1 (415) 123-4567"
}
'''

detected_pii = piiscan.scan(original_value)

print(detected_pii)

Expected result printed in console:

[
    type: EMAIL_ADDRESS, start: 41, end: 53, score: 1.0, 
    type: PERSON, start: 16, end: 24, score: 0.85, 
    type: LOCATION, start: 72, end: 107, score: 0.85, 
    type: URL, start: 46, end: 53, score: 0.5, 
    type: PHONE_NUMBER, start: 127, end: 141, score: 0.4
]

Annotating data to display discovered PII

import piiscan

original_value = '''
{
    "name": "John Doe",
    "email": "john@doe.com",
    "address": "123 Front Street, San Francisco, CA",
    "phone": "+1 (415) 123-4567"
}
'''

detected_pii = piiscan.scan(original_value)

annotated = piiscan.annotate(original_value, detected_pii)

print(annotated)

Expected result printed in console (cleaned for readability):

{  
    "name": "('John Doe', 'PERSON')",  
    "email": "('john@doe.com', 'EMAIL_ADDRESS'), ('doe.com', 'URL')",   
    "address": "('123 Front Street, San Francisco, CA', 'LOCATION')", 
    "phone": "('+1 (415) 123-4567', 'PHONE_NUMBER')"
}

Credits

The basis for this project is derived from Pixie Labs' blog post on using NLP to anonymize sensitive PII in structured data, and its corresponding Hugging Face Space.

This project uses Benjamin Kilimnik's en_spacy_pii_fast spaCy model for NER. It is trained using a dataset derived from Privy's synthetic payload generator. The entire model is less than 7 MiB!

This library was designed for use in resource-constrained environments, hence the small 7 MiB en_spacy_pii_fast model. However, if you would like to increased accuracy there are other alternatives described in the original blog post.

This library also makes use of Microsoft Presidio and spaCy.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

piiscan-0.1.7.tar.gz (4.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

piiscan-0.1.7-py3-none-any.whl (5.5 kB view details)

Uploaded Python 3

File details

Details for the file piiscan-0.1.7.tar.gz.

File metadata

  • Download URL: piiscan-0.1.7.tar.gz
  • Upload date:
  • Size: 4.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.0 CPython/3.11.5 Darwin/22.3.0

File hashes

Hashes for piiscan-0.1.7.tar.gz
Algorithm Hash digest
SHA256 a3c8909738b5e43861ca7802c52e868de2f0ef149bb96274466212df2a884d73
MD5 73a553f2bf8b210101cf0c2342bbd9c0
BLAKE2b-256 ebfcd4316ecc410b0b20161844377be68ee017d3aff8cfbb2cc54c3254e86421

See more details on using hashes here.

File details

Details for the file piiscan-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: piiscan-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 5.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.0 CPython/3.11.5 Darwin/22.3.0

File hashes

Hashes for piiscan-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 abf8dbfb80517099eb6ad2ad0c6ae25fb29c9a3443f92714b7816f517cd70a90
MD5 a451150458c862a853552f2f8c77dcbe
BLAKE2b-256 361863738890781a06a941289352133a9ff6d060706a09ab9e2b9773692b6df0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page