Regex modules for the extraction of PII from text chunks
Project description
Pii Extractor plugin: regex
This repository builds a Python package that installs a pii-extract-base
plugin to performs PII detection for text data based on regular expressions
(with optional context). The name of the plugin entry point is
piisa-detectors-regex
.
The PII Tasks in the package are structured by language & country, since many of the PII elements are language- and/or -country dependent.
Requirements
The package
- needs at least Python 3.8
- needs the pii-data and the pii-extract-base base packages
- uses the regex package (instead of the standard
re
package in the core Python library) - uses the python-stdnum package to validate many identifiers (and the python-phonenumbers to validate phone numbers)
Usage
The package does not have any user-facing entry points, and it is used automatically by the PIISA framework.
Building
The provided Makefile can be used to process the package:
make pkg
will build the Python package, creating a file that can be installed withpip
make unit
will launch all unit tests (using pytest, so pytest must be available)make install
will install the package in a Python virtualenv. The virtualenv will be chosen as, in this order:- the one defined in the
VENV
environment variable, if it is defined - if there is a virtualenv activated in the shell, it will be used
- otherwise, a default is chosen as
/opt/venv/bigscience
(it will be created if it does not exist)
- the one defined in the
Contributing
To add a new PII processing task, please see the contributing instructions.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Hashes for pii-extract-plg-regex-0.4.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 79fcfa0b5d3db41cd2de8fe90908825cbba6d008ad266447b4fb7de1069a2aa7 |
|
MD5 | fa8ebf88cf25bf9f8081f55979ee0c81 |
|
BLAKE2b-256 | aeb9b4723a5ec6e16ce4cb3fb235d7f172ceeab6545e68fac579444d9b146ade |