Skip to main content

No project description provided

Project description

PII Codex logo


PII Codex - PII Detection, Categorization, and Severity Assessment

made-with-python Maintenance codecov License License: Hippocratic 3.0 Python 3.9 DOI

The PII Codex project was built as a core part of an ongoing research effort in Personal Identifiable Information (PII) detection and risk assessment. There was a need to not only detect PII in unstructured text, but also identify its severity, associated categorizations in cybersecurity research and policy documentation, and provide a way for others in similar research efforts to reproduce or extend the research. PII Codex is a combination of systematic research, conceptual frameworks, third-party open source software, and cloud service provider integrations. The categorizations are directly influenced by the research of Milne et al. (2016) while the ranking is a result of category severities on the scale provided by Schwartz and Solove (2012) from Non-Identifiable, Semi-Identifiable, and Identifiable.

The outputs of the primary PII Codex analysis and adapter functions are AnalysisResult or AnalysisResultSet objects that will provide a listing of detections, severities, mean risk scores for each string processed, and summary statistics on the analysis made. The final outputs do not contain the original texts but instead will provide where to find the detections should the end-user care for this information in their analysis.


Importing

The repository releases are hosted on PyPi.

Using pip:

pip3 install -i pii-codex

Using Poetry:

poetry add pii-codex

If you are in need of the integrated Microsoft Presidio Analyzer, you'll also need to install the en_core_web_lg and the PII-Codex extras:

poetry install pii-codex --extras="detections"
python3 -m spacy download en_core_web_lg

Usage

Sample Input / Output

The built-in analyzer uses Microsoft Presidio. Feed in a collection of strings with analyze_collection() or just a single string with analyze_item(). Those analyzing a collection of strings will also be provided with statistics calculated on the risk scores for detected items.

from pii_codex.services.analysis_service import PIIAnalysisService
PIIAnalysisService().analyze_collection(
    texts=["your collection of strings"],
    language_code="en",
    collection_name="Data Set Label", # Optional Labeling
    collection_type="SAMPLE" # Defaults to POPULATION, used stats calculations
)

You can also pass in a data param (dataframe) instead of simple text array with a text column and a metadata column to be analyzed for those analyzing social media posts. Current metadata supported are URL, LOCATION, and SCREEN_NAME.

Sample output (results object converted to dict from notebook):

{
    "collection_name": "PII Collection 1",
    "collection_type": "POPULATION",
    "analyses": [
        {
            "analysis": [
                {
                    "pii_type_detected": "PERSON",
                    "risk_level": 3,
                    "risk_level_definition": "Identifiable",
                    "cluster_membership_type": "Financial Information",
                    "hipaa_category": "Protected Health Information",
                    "dhs_category": "Linkable",
                    "nist_category": "Directly PII",
                    "entity_type": "PERSON",
                    "score": 0.85,
                    "start": 21,
                    "end": 24,
                }
            ],
            "index": 0,
            "risk_score_mean": 3,
        },
        ...
    ],
    "detection_count": 5,
    "risk_scores": [3, 2.6666666666666665, 1, 2, 1],
    "risk_score_mean": 1.9333333333333333,
    "risk_score_mode": 1,
    "risk_score_median": 2,
    "risk_score_standard_deviation": 0.8273115763993905,
    "risk_score_variance": 0.6844444444444444,
    "detected_pii_types": [
        "LOCATION",
        "EMAIL_ADDRESS",
        "URL",
        "PHONE_NUMBER",
        "PERSON",
    ],
    "detected_pii_type_frequencies": {
        "PERSON": 1,
        "EMAIL_ADDRESS": 1,
        "PHONE_NUMBER": 1,
        "URL": 1,
        "LOCATION": 1,
    },
}

Docs

For more information on usage, check out the respective documentation for guidance on using PII-Codex.

Topic Document Description
PII Type Mappings PII Mappings Overview of how to perform mappings between PII types and how to review store PII types.
PII Detections and Analysis PII Detection and Analysis Overview of how to detect and analyze strings
Local Repo Setup Local Repo Setup Instructions for local repository setup
Example Analysis Example Analysis Notebook Notebook with example analysis using MSFT Presidio

Community Guidelines

Contributions

In general, you can contribute to this project by creating issues. You are also welcome to contribute to the source code directly by forking the project, modifying the code, and creating pull requests. Please use clear and organized descriptions when creating issues and pull requests and leverage the templates when possible.

Bug Report and Support Requests

You can use issues to report bugs and seek support. Before creating any new issues, please check for similar ones in the issue list first.

Attributions

This project benefited greatly from a number of PII research works like that from Milne et al (2016), Schwartz and Solove (2012), and the documentation by NIST, DHS, and HIPAA. A special thanks to all the open source projects, and frameworks that made the setup and structuring of this project much easier like Poetry, Microsoft Presidio, spaCy, Jupyter, and several others.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pii_codex-0.2.3.tar.gz (34.1 kB view details)

Uploaded Source

Built Distribution

pii_codex-0.2.3-py3-none-any.whl (40.6 kB view details)

Uploaded Python 3

File details

Details for the file pii_codex-0.2.3.tar.gz.

File metadata

  • Download URL: pii_codex-0.2.3.tar.gz
  • Upload date:
  • Size: 34.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.2 CPython/3.9.15 Linux/5.15.0-1022-azure

File hashes

Hashes for pii_codex-0.2.3.tar.gz
Algorithm Hash digest
SHA256 9e8a9ef572d05457b610768d11f70778072cc48633273e6214fdfe1f527edbfa
MD5 25630e857fc973c48b9649c803c09300
BLAKE2b-256 dc9bdb3af2957ab9117c9e4091357e51607ca659ae0fad7d4837f4bfc8688afe

See more details on using hashes here.

File details

Details for the file pii_codex-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: pii_codex-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 40.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.2 CPython/3.9.15 Linux/5.15.0-1022-azure

File hashes

Hashes for pii_codex-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 da42f26744c3eb62caf59d2056b2321b318f92ce660df0af8e6296fd0bc828ff
MD5 431c7b8969a43acb0c09a48b745d66fe
BLAKE2b-256 a8daac9a3f2747db53d405f7e9d94bf5766486b66522a06e8a59d2d111184a04

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page