A python library to perform NER on structured data and generate PII with Faker
Project description
Nerpii
Nerpii is a Python library developed to perform Named Entity Recognition (NER) on structured datasets and synthesize Personal Identifiable Information (PII).
NER is performed with Presidio and with a NLP model available on HuggingFace, while the PII generation is based on Faker.
Installation
You can install Nerpii by using pip:
pip install nerpii
Quickstart
Named Entity Recognition
You can import the NamedEntityRecognizer using
from nerpii.named_entity_recognizer import NamedEntityRecognizer
You can create a recognizer passing as parameter a path to a csv file or a Pandas Dataframe
recognizer = NamedEntityRecognizer('./csv_path.csv')
Please note that if there are columns in the dataset containing names of people consisting of first and last names (e.g. John Smith), before creating a recognizer, it is necessary to split the name into two different columns called first_name and last_name using the function split_name()
.
from nerpii.named_entity_recognizer import split_name
df = split_name('./csv_path.csv', name_of_column_to_split)
The NamedEntityRecognizer class contains three methods to perform NER on a dataset:
recognizer.assign_entities_with_presidio()
which assigns Presidio entities, listed here
recognizer.assign_entities_manually()
which assigns manually ZIPCODE and CREDIT_CARD_NUMBER entities
recognizer.assign_organization_entity_with_model()
which assigns ORGANIZATION entity using a NLP model available on HuggingFace.
To perform NER, you have to run these three methods sequentially, as reported below:
recognizer.assign_entities_with_presidio()
recognizer.assign_entities_manually()
recognizer.assign_organization_entity_with_model()
The final output is a dictionary in which column names are given as keys and assigned entities and a confidence score as values.
This dictionary can be accessed using
recognizer.dict_global_entities
PII generation
After performing NER on a dataset, you can generate new PII using Faker.
You can import the FakerGenerator using
from nerpii.faker_generator import FakerGenerator
You can create a generator using
generator = FakerGenerator(dataset, recognizer.dict_global_entities)
To generate new PII you can run
generator.get_faker_generation()
The method above can generate the following PII:
- address
- phone number
- email naddress
- first name
- last name
- city
- state
- url
- zipcode
- credit card
- ssn
- country
Examples
You can find a notebook example in the notebook folder.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file nerpii-0.1.3.tar.gz
.
File metadata
- Download URL: nerpii-0.1.3.tar.gz
- Upload date:
- Size: 8.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.2.2 CPython/3.9.16 Linux/5.15.0-1034-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1e21cceb77e721a8348fe59d48f194b522f01e7f4ee68b0933a2e3b963c5aba6 |
|
MD5 | de5f5053d0df7e0f7d126fc6c5422bbb |
|
BLAKE2b-256 | 14b4884fd14b5763ca2f5f966f2facc662ddf5a30efe920df266a23ff35ca638 |
File details
Details for the file nerpii-0.1.3-py3-none-any.whl
.
File metadata
- Download URL: nerpii-0.1.3-py3-none-any.whl
- Upload date:
- Size: 8.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.2.2 CPython/3.9.16 Linux/5.15.0-1034-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 22899daf1de8c140089ea0d11f3460ef065c5dd4a4a927dc3721ebddc95af0c1 |
|
MD5 | 6b99dae9945800a1da7d828e0e0f796f |
|
BLAKE2b-256 | 4137f26bdb58d0e220d1eff65322d27e66749e5fc14a7026f1f36bf234477487 |