pii-codex

No project description provided

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 1 - Planning
Intended Audience
- Science/Research
License
- OSI Approved :: BSD License
- Other/Proprietary License
Programming Language

Project description

alt text

PII Detection, Categorization, and Severity Assessment

The PII Codex project was built as a core part of an ongoing research effort in Personal Identifiable Information (PII) detection and risk assessment (to be publicly released later in 2023). There was a need to not only detect PII in text, but also identify its severity, associated categorizations in cybersecurity research and policy documentation, and provide a way for others in similar research efforts to reproduce or extend the research. PII Codex is a combination of systematic research, conceptual frameworks, third-party open source software, and cloud service provider integrations. The categorizations are directly influenced by the research of Milne et al. (2016) while the ranking is a result of category severities on the scale provided by Schwartz and Solove (2012) from Non-Identifiable, Semi-Identifiable, and Identifiable.

The outputs of the primary PII Codex analysis and adapter functions are AnalysisResult or AnalysisResultSet objects that will provide a listing of detections, severities, mean risk scores for each string processed, and summary statistics on the analysis made. The final outputs do not contain the original texts but instead will provide where to find the detections should the end-user care for this information in their analysis.

Statement of Need

The general knowledge base of identifiable data, the usage restrictions of this data, and the associated policies surrounding it have shifted drastically over the years. The tech industry has had to adjust to many policy changes regarding the tracking of individuals, the usage of data from online profiles and platforms, and the right to be forgotten entirely from a service or platform (GDPR). While the shift has provided data protections around the globe, the majority of technology users continue to have little to no control over their personal information with third-party data consumers (Trepte, 2020).

Understanding if identifiable data types exist in a data set can prevent accidental sharing of such data by allowing its detection in the first place and, in the case of this software package, present sanitized strings, the reasons to why the token was considered to be PII, and permit for the results to be publishable.

Potential Usages

Potential usages include sanitizing of dataset strings (e.g. a collection of social media posts), presenting results to users for software examining their interactions (e.g. UX research on user-awareness in cybersecurity applications), etc.

Running Locally

This project uses Poetry. To run this project, install poetry and proceed to follow the instructions under /docs/LOCAL_SETUP.md.

Note: This project has only been tested with Ubuntu and MacOS.

Importing

Before adding pii-codex on your project, download the spaCy en_core_web_lg model:

python3 -m spacy download en_core_web_lg

The repository releases are hosted on PyPi.

Using pip (must have latest pip version and running Python 3.9 or 3.10):

pip install --upgrade pip
pip install pii-codex
pip install pii-codex[detections]

Using Poetry:

poetry update
poetry add pii-codex
poetry install pii-codex --extras="detections"

For those using Google Collab, check out the example notebook:

Usage

Sample Input / Output

The built-in analyzer uses Microsoft Presidio. Feed in a collection of strings with analyze_collection() or just a single string with analyze_item(). Those analyzing a collection of strings will also be provided with statistics calculated on the risk scores for detected items.

from pii_codex.services.analysis_service import PIIAnalysisService
PIIAnalysisService().analyze_collection(
    texts=["your collection of strings"],
    language_code="en",
    collection_name="Data Set Label", # Optional Labeling
    collection_type="SAMPLE" # Defaults to POPULATION, used stats calculations
)

You can also pass in a data param (dataframe) instead of simple text array with a text column and a metadata column to be analyzed for those analyzing social media posts. Current metadata supported are URL, LOCATION, and SCREEN_NAME.

Sample output (results object converted to dict from notebook):

{
    "collection_name": "PII Collection 1",
    "collection_type": "POPULATION",
    "analyses": [
        {
            "analysis": [
                {
                    "pii_type_detected": "PERSON",
                    "risk_level": 3,
                    "risk_level_definition": "Identifiable",
                    "cluster_membership_type": "Financial Information",
                    "hipaa_category": "Protected Health Information",
                    "dhs_category": "Linkable",
                    "nist_category": "Directly PII",
                    "entity_type": "PERSON",
                    "score": 0.85,
                    "start": 21,
                    "end": 24,
                }
            ],
            "index": 0,
            "risk_score_mean": 3,
            "sanitized_text: "Hi! My name is <REDACTED>",
        },
        ...
    ],
    "detection_count": 5,
    "risk_scores": [3, 2.6666666666666665, 1, 2, 1],
    "risk_score_mean": 1.9333333333333333,
    "risk_score_mode": 1,
    "risk_score_median": 2,
    "risk_score_standard_deviation": 0.8273115763993905,
    "risk_score_variance": 0.6844444444444444,
    "detected_pii_types": {
        "LOCATION",
        "EMAIL_ADDRESS",
        "URL",
        "PHONE_NUMBER",
        "PERSON",
    },
    "detected_pii_type_frequencies": {
        "PERSON": 1,
        "EMAIL_ADDRESS": 1,
        "PHONE_NUMBER": 1,
        "URL": 1,
        "LOCATION": 1,
    },
}

Docs

For more information on usage, check out the respective documentation for guidance on using PII-Codex.

Topic	Document	Description
PII Type Mappings	PII Mappings	Overview of how to perform mappings between PII types and how to review store PII types.
PII Detections and Analysis	PII Detection and Analysis	Overview of how to detect and analyze strings
Local Repo Setup	Local Repo Setup	Instructions for local repository setup
Example Analysis	Example Analysis Notebook	Notebook with example analysis using MSFT Presidio

Attributions

This project benefited greatly from a number of PII research works like that from Milne et al (2016) with the definition of the types and categories, Schwartz and Solove (2012) with the severity levels of Non-Identifiable, Semi-Identifiable, and Identifiable, and the documentation by NIST, DHS (2012), and HIPAA. A special thanks to all the open source projects, and frameworks that made the setup and structuring of this project much easier like Poetry, Microsoft Presidio, spaCy (2017), Jupyter, and several others.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 1 - Planning
Intended Audience
- Science/Research
License
- OSI Approved :: BSD License
- Other/Proprietary License
Programming Language

Release history Release notifications | RSS feed

0.4.6

Jun 18, 2023

0.4.5

Jun 18, 2023

This version

0.4.4

Apr 25, 2023

0.4.3

Dec 28, 2022

0.4.2

Dec 27, 2022

0.4.1

Dec 20, 2022

0.4.0

Dec 16, 2022

0.3.0

Nov 16, 2022

0.2.3

Oct 24, 2022

0.2.2

Oct 23, 2022

0.2.1

Oct 16, 2022

0.2.0

Oct 16, 2022

0.1.0

Oct 9, 2022

0.0.7

Oct 9, 2022

0.0.6

Oct 9, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pii_codex-0.4.4.tar.gz (26.6 kB view hashes)

Uploaded Apr 25, 2023 Source

Built Distribution

pii_codex-0.4.4-py3-none-any.whl (32.3 kB view hashes)

Uploaded Apr 25, 2023 Python 3

Hashes for pii_codex-0.4.4.tar.gz

Hashes for pii_codex-0.4.4.tar.gz
Algorithm	Hash digest
SHA256	`816f2cdd8ef2409a8de2b0d49b06d741eae10d8cc3a9a131dd2e93a5e7ea163d`
MD5	`71ff40b12442104419370c114ced2968`
BLAKE2b-256	`36d0b7a22a768d80f064fcaa6ab5f576825ce765bdfbae11bb932611cce59c32`

Hashes for pii_codex-0.4.4-py3-none-any.whl

Hashes for pii_codex-0.4.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1bcb34481f325bcbdd561e684bdaf068814cb5b697fc39947b586552841a111e`
MD5	`dd736d40a75bac232bcd1d8eba7b59a7`
BLAKE2b-256	`10c7e4d91a79d2d620c16a96d63b5cbec2bf6f804277e3cbe538b66a028b46b0`