Read the latest Real Python tutorials

These details have not been verified by PyPI

Project links

Homepage

Project description

Fork of Presidio-research, modifying some utility functions

This package features data-science related tasks for developing new recognizers for Presidio. It is used for the evaluation of the entire system, as well as for evaluating specific PII recognizers or PII detection models. In addition, it contains a fake data generator which creates fake sentences based on templates and fake PII.

Who should use it?

Anyone interested in developing or evaluating PII detection models, an existing Presidio instance or a Presidio PII recognizer.
Anyone interested in generating new data based on previous datasets or sentence templates (e.g. to increase the coverage of entity values) for Named Entity Recognition models.

Getting started

To install the package:

Clone the repo
Install all dependencies, preferably in a virtual environment:

# Create conda env (optional)
conda create --name presidio python=3.9
conda activate presidio

# Install package+dependencies
pip install -r requirements.txt
python setup.py install

# Download a spaCy model used by presidio-analyzer
python -m spacy download en_core_web_lg

# Verify installation
pytest

Note that some dependencies (such as Flair and Stanza) are not automatically installed to reduce installation complexity.

What's in this package?

Fake data generator for PII recognizers and NER models
Data representation layer for data generation, modeling and analysis
Multiple Model/Recognizer evaluation files (e.g. for Spacy, Flair, CRF, Presidio API, Presidio Analyzer python package, specific Presidio recognizers)
Training and modeling code for multiple models
Helper functions for results analysis

1. Data generation

See Data Generator README for more details.

The data generation process receives a file with templates, e.g. My name is {{name}}. Then, it creates new synthetic sentences by sampling templates and PII values. Furthermore, it tokenizes the data, creates tags (either IO/BIO/BILUO) and spans for the newly created samples.

For information on data generation/augmentation, see the data generator README.
For an example for running the generation process, see this notebook.
For an understanding of the underlying fake PII data used, see this exploratory data analysis notebook.

Once data is generated, it could be split into train/test/validation sets while ensuring that each template only exists in one set. See this notebook for more details.

2. Data representation

In order to standardize the process, we use specific data objects that hold all the information needed for generating, analyzing, modeling and evaluating data and models. Specifically, see data_objects.py.

The standardized structure, List[InputSample] could be translated into different formats:

CONLL

from presidio_evaluator import InputSample
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
conll = InputSample.create_conll_dataset(dataset)
conll.to_csv("dataset.csv", sep="\t")

spaCy v3

from presidio_evaluator import InputSample
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
InputSample.create_spacy_dataset(dataset, output_path="dataset.spacy")

Flair

from presidio_evaluator import InputSample
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
flair = InputSample.create_flair_dataset(dataset)

json

from presidio_evaluator import InputSample
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
InputSample.to_json(dataset, output_file="dataset_json")

3. PII models evaluation

The presidio-evaluator framework allows you to evaluate Presidio as a system, a NER model, or a specific PII recognizer for precision and recall and error-analysis.

Examples:

4. Training PII detection models

CRF

To train a vanilla CRF on a new dataset, see this notebook. To evaluate, see this notebook.

spaCy

To train a new spaCy model, first save the dataset in a spaCy format:

# dataset is a List[InputSample]
InputSample.create_spacy_dataset(dataset ,output_path="dataset.spacy")

To evaluate, see this notebook

Flair

To train Flair models, see this helper class or this snippet:

from presidio_evaluator.models import FlairTrainer
train_samples = "data/generated_train.json"
test_samples = "data/generated_test.json"
val_samples = "data/generated_validation.json"

trainer = FlairTrainer()
trainer.create_flair_corpus(train_samples, test_samples, val_samples)

corpus = trainer.read_corpus("")
trainer.train(corpus)

Note that the three json files are created using InputSample.to_json.

For more information

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Fake Name Generator identities by the Fake Name Generator are licensed under a Creative Commons Attribution-Share Alike 3.0 United States License. Fake Name Generator and the Fake Name Generator logo are trademarks of Corban Works, LLC.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.0.71

Nov 3, 2022

0.0.70

Nov 3, 2022

0.0.69

Nov 3, 2022

0.0.67

Nov 3, 2022

0.0.66

Nov 3, 2022

0.0.65

Nov 3, 2022

0.0.64

Nov 3, 2022

0.0.63

Nov 3, 2022

0.0.62

Nov 1, 2022

0.0.61

Oct 19, 2022

0.0.60

Oct 19, 2022

0.0.59

Oct 19, 2022

0.0.58

Oct 19, 2022

0.0.57

Oct 19, 2022

0.0.56

Oct 19, 2022

0.0.55

Oct 19, 2022

0.0.54

Oct 19, 2022

0.0.53

Oct 19, 2022

0.0.52

Oct 19, 2022

0.0.51

Sep 28, 2022

0.0.50

Sep 28, 2022

0.0.49

Sep 28, 2022

This version

0.0.48

Sep 24, 2022

0.0.47

Sep 24, 2022

0.0.46

Sep 24, 2022

0.0.45

Sep 24, 2022

0.0.44

Sep 23, 2022

0.0.43

Sep 23, 2022

0.0.42

Sep 23, 2022

0.0.41

Sep 23, 2022

0.0.40

Sep 23, 2022

0.0.39

Sep 23, 2022

0.0.38

Sep 22, 2022

0.0.37

Sep 22, 2022

0.0.36

Sep 22, 2022

0.0.35

Sep 22, 2022

0.0.34

Sep 22, 2022

0.0.33

Sep 22, 2022

0.0.32

Sep 22, 2022

0.0.31

Sep 22, 2022

0.0.30

Sep 22, 2022

0.0.29

Sep 22, 2022

0.0.28

Sep 22, 2022

0.0.27

Sep 20, 2022

0.0.26

Sep 20, 2022

0.0.25

Sep 20, 2022

0.0.24

Sep 20, 2022

0.0.23

Sep 20, 2022

0.0.22

Sep 20, 2022

0.0.21

Sep 19, 2022

0.0.20

Sep 18, 2022

0.0.19

Sep 18, 2022

0.0.18

Sep 18, 2022

0.0.17

Sep 17, 2022

0.0.16

Sep 17, 2022

0.0.15

Sep 15, 2022

0.0.14

Sep 3, 2022

0.0.13

Sep 3, 2022

0.0.12

Sep 3, 2022

0.0.11

Sep 2, 2022

0.0.10

Aug 31, 2022

0.0.9

Aug 31, 2022

0.0.8

Aug 31, 2022

0.0.7

Aug 31, 2022

0.0.6

Aug 31, 2022

0.0.5

Aug 31, 2022

0.0.4

Aug 31, 2022

0.0.3

Aug 31, 2022

0.0.2

Aug 31, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

privy-presidio-utils-0.0.48.tar.gz (444.9 kB view details)

Uploaded Sep 24, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

privy_presidio_utils-0.0.48-py3-none-any.whl (854.9 kB view details)

Uploaded Sep 24, 2022 Python 3

File details

Details for the file privy-presidio-utils-0.0.48.tar.gz.

File metadata

Download URL: privy-presidio-utils-0.0.48.tar.gz
Upload date: Sep 24, 2022
Size: 444.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.7.7

File hashes

Hashes for privy-presidio-utils-0.0.48.tar.gz
Algorithm	Hash digest
SHA256	`2a75e31cbb32af7dd6ba6759e5f092dc926d8c51104bb2a3ffa1c7ba5270a583`
MD5	`38201a986357ca49076d09828c773f49`
BLAKE2b-256	`e56e395cbae86fa44fe0154af100cd9b8f2da36e416904999ea69b9a4049b194`

See more details on using hashes here.

File details

Details for the file privy_presidio_utils-0.0.48-py3-none-any.whl.

File metadata

Download URL: privy_presidio_utils-0.0.48-py3-none-any.whl
Upload date: Sep 24, 2022
Size: 854.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.7.7

File hashes

Hashes for privy_presidio_utils-0.0.48-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a061ae7ca066153d06dae2048e65b9966fc40447358a98e971841a7776b284a9`
MD5	`608d035d597a74a0c15cf208d43a24eb`
BLAKE2b-256	`bd0667dbc39af79d8f707b7a57704076ed2267032475c30dacaabd3b4e528767`

See more details on using hashes here.

privy-presidio-utils 0.0.48

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Fork of Presidio-research, modifying some utility functions

Who should use it?

Getting started

What's in this package?

1. Data generation

2. Data representation

3. PII models evaluation

Examples:

4. Training PII detection models

CRF

spaCy

Flair

For more information

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes