Skip to main content

A python library to perform NER on structured data and generate PII with Faker

Project description

Nerpii

Nerpii is a Python library developed to perform Named Entity Recognition (NER) on structured datasets and synthesize Personal Identifiable Information (PIIs).

NER is performed with Presidio and with a NLP model available on HuggingFace, while the PII generation is based on Faker.

Installation

You can install Nerpii by using pip:

pip install nerpii

Quickstart

Named Entity Recognition

You can import the NamedEntityRecognizer using

from nerpii.named_entity_recognizer import NamedEntityRecognizer

You can create a recognizer passing as parameter a path to a csv file or a Pandas Dataframe

recognizer = NamedEntityRecognizer('./csv_path.csv')

Please note that if there are columns in the dataset containing names of people consisting of first and last names (e.g. John Smith), before creating a recognizer, it is necessary to split the name into two different columns called first_name and last_name using the function split_name().

from nerpii.named_entity_recognizer import split_name

df = split_name('./csv_path.csv', name_of_column_to_split)

The NamedEntityRecognizer class contains three methods to perform NER on a dataset:

recognizer.assign_entities_with_presidio()

which assigns Presidio entities, listed here

recognizer.assign_entities_manually()

which assigns manually ZIPCODE and CREDIT_CARD_NUMBER entities

recognizer.assign_organization_entity_with_model()

which assigns ORGANIZATION entity using a NLP model available on HuggingFace.

To perform NER, you have to run these three methods sequentially, as reported below:

recognizer.assign_entities_with_presidio()
recognizer.assign_entities_manually()
recognizer.assign_organization_entity_with_model()

The final output is a dictionary in which column names are given as keys and assigned entities and a confidence score as values.

This dictionary can be accessed using

recognizer.dict_global_entities

PII generation

After perfoming NER on a dataset, you can generate new PIIs using Faker.

You can import the FakerGenerator using

from nerpii.faker_generator import FakerGenerator

You can create a generator using

generator = FakerGenerator(recognizer.dataset, recognizer.dict_global_entities)

To generate new PIIs you can run

generator.get_faker_generation()

The method above can generate the following PIIs:

  • address
  • phone number
  • email naddress
  • first name
  • last name
  • city
  • state
  • url
  • zipcode
  • credit card
  • ssn
  • country

Examples

You can find a notebook example in the notebook folder

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nerpii-0.1.0.tar.gz (8.6 kB view details)

Uploaded Source

Built Distribution

nerpii-0.1.0-py3-none-any.whl (8.4 kB view details)

Uploaded Python 3

File details

Details for the file nerpii-0.1.0.tar.gz.

File metadata

  • Download URL: nerpii-0.1.0.tar.gz
  • Upload date:
  • Size: 8.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.2 CPython/3.9.16 Linux/5.15.0-1034-azure

File hashes

Hashes for nerpii-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ab5b14be00ba0ab6bb543ab31f55bcd1db2771d36e9a7c971909271d607cd25c
MD5 ffd1c143062294b7ff2355a382c41045
BLAKE2b-256 3f9c712360e43805bf2f396ca53da8ada0002e8463c6d9876c2efeb7e4caad56

See more details on using hashes here.

File details

Details for the file nerpii-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: nerpii-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.2 CPython/3.9.16 Linux/5.15.0-1034-azure

File hashes

Hashes for nerpii-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2429cc4ee9416347355244cbcf194608040272874311fa80b94cfca0b8e225e1
MD5 339964a98a2518fcfccc7fb00f7260db
BLAKE2b-256 d534790adf5852cf4f8b4cf748fa44456598268939dca1276154ca0a5208e5f6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page