Skip to main content

No project description provided

Project description

Presidio-research

This package provides evaluation and data-science capabilities for Presidio and PII detection models in general.

It also includes a fake data generator that creates synthetic sentences based on templates and fake PII.

Who should use it?

  • Anyone interested in developing or evaluating PII detection models, an existing Presidio instance or a Presidio PII recognizer.
  • Anyone interested in generating new data based on previous datasets or sentence templates (e.g., to increase the coverage of entity values) for Named Entity Recognition models.

Getting started

Using notebooks

The easiest way to get started is by reviewing the notebooks.

  • Notebook 1: Shows how to use the PII data generator.
  • Notebook 2: Shows a simple analysis of the PII dataset.
  • Notebook 3: Provides tools to split the dataset into train/test/validation sets while avoiding leakage due to the same pattern appearing in multiple folds (only applicable for synthetically generated data).
  • Notebook 4: Shows how to use the evaluation tools to evaluate how well Presidio detects PII. Note that this is using the vanilla Presidio, and the results aren't very accurate.
  • Notebook 5: Shows how one can configure Presidio to detect PII much more accurately, and boost the f score in ~30%.

Installation

Note: Presidio evaluator requires Python version 3.9 or higher.

From PyPI

conda create --name presidio python=3.9
conda activate presidio
pip install presidio-evaluator
python -m spacy download en_core_web_sm # for tokenization
python -m spacy download en_core_web_lg # for NER

From source

To install the package:

  1. Clone the repo
  2. Install all dependencies:
# Install package+dependencies
pip install poetry
poetry install --with=dev

# Download tge spaCy pipeline used for tokenization
poetry run python -m spacy download en_core_web_sm

# To install with all additional NER dependencies (e.g. Flair, Stanza), run:
# poetry install --with='ner,dev'

# To use the default Presidio configuration, a spaCy model is required:
poetry run python -m spacy download en_core_web_lg

# Verify installation
pytest

Note that some dependencies (such as Flair and Stanza) are not automatically installed to reduce installation complexity.

What's in this package?

  1. Fake data generator for PII recognizers and NER models
  2. Data representation layer for data generation, modeling and analysis
  3. Multiple Model/Recognizer evaluation files (e.g. for Presidio, Spacy, Flair, Azure AI Language)
  4. Training and modeling code for multiple models
  5. Helper functions for results analysis

1. Data generation

See Data Generator README for more details.

The data generation process takes a file with templates, e.g. My name is {{name}}. Then, it creates new synthetic sentences by sampling templates and PII values. Furthermore, it tokenizes the data, creates tags (either IO/BIO/BILUO) and spans for the newly created samples.

Once data is generated, it could be split into train/test/validation sets while ensuring that each template only exists in one set. See this notebook for more details.

2. Data representation

In order to standardize the process, we use specific data objects that hold all the information needed for generating, analyzing, modeling and evaluating data and models. Specifically, see data_objects.py.

The standardized structure, List[InputSample], can be translated into different formats:

  • CoNLL

    • To CoNLL:

      from presidio_evaluator import InputSample
      dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
      conll = InputSample.create_conll_dataset(dataset)
      conll.to_csv("dataset.csv", sep="\t")
      
    • From CoNLL

      from pathlib import Path
      from presidio_evaluator.dataset_formatters import CONLL2003Formatter
      # Read from a folder containing ConLL2003 files
      conll_formatter = CONLL2003Formatter(files_path=Path("data/conll2003").resolve())
      train_samples = conll_formatter.to_input_samples(fold="train")
      
  • spaCy v3

    from presidio_evaluator import InputSample
    dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
    InputSample.create_spacy_dataset(dataset, output_path="dataset.spacy")
    
  • Flair

    from presidio_evaluator import InputSample
    dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
    flair = InputSample.create_flair_dataset(dataset)
    
  • json

    from presidio_evaluator import InputSample
    dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
    InputSample.to_json(dataset, output_file="dataset_json")
    

3. PII models evaluation

The presidio-evaluator framework allows you to evaluate Presidio as a system, a NER model, or a specific PII recognizer for precision, recall, and error analysis. See Notebook 5 for an example.

For more information

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Copyright notice:

Fake Name Generator identities by the Fake Name Generator are licensed under a Creative Commons Attribution-Share Alike 3.0 United States License. Fake Name Generator and the Fake Name Generator logo are trademarks of Corban Works, LLC.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

presidio_evaluator-0.2.4-py3-none-any.whl (645.5 kB view details)

Uploaded Python 3

File details

Details for the file presidio_evaluator-0.2.4-py3-none-any.whl.

File metadata

File hashes

Hashes for presidio_evaluator-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 0eed1f5e6fe0b9bc50e1937e6f388feef2efa527d7c30576d1a6fac076a82958
MD5 585649982bba126a515dafe99e097b10
BLAKE2b-256 95b461fda3350eaa0e38d40693e5ab802e5d255e879ab3d2c029071677553191

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page