Regex modules for the extraction of PII from text chunks
Project description
Pii Extractor plugin: regex
This repository builds a Python package that installs a pii-extract-base
plugin to performs PII detection for text data based on regular expressions
(with optional context). The name of the plugin entry point is
piisa-detectors-regex
.
The PII Tasks in the package are structured by language & country, since many of the PII elements are language- and/or -country dependent.
Requirements
The package
- needs at least Python 3.8
- needs the pii-data and the pii-extract-base base packages
- uses the regex package (instead of the standard
re
package in the core Python library) - uses the python-stdnum package to validate many identifiers (and the python-phonenumbers to validate phone numbers)
Usage
The package does not have any user-facing entry points, and it is used automatically by the PIISA framework.
Building
The provided Makefile can be used to process the package:
make pkg
will build the Python package, creating a file that can be installed withpip
make unit
will launch all unit tests (using pytest, so pytest must be available)make install
will install the package in a Python virtualenv. The virtualenv will be chosen as, in this order:- the one defined in the
VENV
environment variable, if it is defined - if there is a virtualenv activated in the shell, it will be used
- otherwise, a default is chosen as
/opt/venv/bigscience
(it will be created if it does not exist)
- the one defined in the
Contributing
To add a new PII processing task, please see the contributing instructions.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file pii-extract-plg-regex-0.5.1.tar.gz
.
File metadata
- Download URL: pii-extract-plg-regex-0.5.1.tar.gz
- Upload date:
- Size: 24.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1ff882fa5a36c39633aa93c38731869acecd5f71bbecc46c856e2219b07d1d85 |
|
MD5 | 526e698972703cf3240043bd5eadb52c |
|
BLAKE2b-256 | ab8168ee28e00787824e53f0f00c81eec7d5cdd3ef160889dd7524d1f3632112 |