Regex modules for the extraction of PII from text chunks
Project description
Pii Extractor plugin: regex
This repository builds a Python package that installs a pii-extract-base
plugin to performs PII detection for text data based on regular expressions
(with optional context). The name of the plugin entry point is
piisa-detectors-regex
.
The PII Tasks in the package are structured by language & country, since many of the PII elements are language- and/or -country dependent.
Requirements
The package
- needs at least Python 3.8
- needs the pii-data and the pii-extract-base base packages
- uses the regex package (instead of the standard
re
package in the core Python library) - uses the python-stdnum package to validate many identifiers (and the python-phonenumbers to validate phone numbers)
Usage
The package does not have any user-facing entry points, and it is used automatically by the PIISA framework.
Building
The provided Makefile can be used to process the package:
make pkg
will build the Python package, creating a file that can be installed withpip
make unit
will launch all unit tests (using pytest, so pytest must be available)make install
will install the package in a Python virtualenv. The virtualenv will be chosen as, in this order:- the one defined in the
VENV
environment variable, if it is defined - if there is a virtualenv activated in the shell, it will be used
- otherwise, a default is chosen as
/opt/venv/bigscience
(it will be created if it does not exist)
- the one defined in the
Contributing
To add a new PII processing task, please see the contributing instructions.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Hashes for pii-extract-plg-regex-0.5.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1ff882fa5a36c39633aa93c38731869acecd5f71bbecc46c856e2219b07d1d85 |
|
MD5 | 526e698972703cf3240043bd5eadb52c |
|
BLAKE2b-256 | ab8168ee28e00787824e53f0f00c81eec7d5cdd3ef160889dd7524d1f3632112 |