Binonymizer is a tool in Python that aims at tagging personal data in a parallel corpus.

These details have not been verified by PyPI

Project links

Project description

binonymizer

Binonymizer is a tool in Python that aims at tagging personal data¹ in a parallel corpus.

For example, for a input like:

URL1  URL2  My name is Marta and my email is fake@email.com    Mi nombre es Marta y mi email es fake@email.com

Binonymizer's output will be:

URL1 URL2 My name is <entity class="PER">Marta</entity> and my email is <entity class="EMAIL">fake@email.com</entity> Mi nombre es <entity class="PER">Marta</entity> y mi email es <entity class="EMAIL">fake@email.com</entity>

Detectable entity tipes

Currently, the Binonymizer is able to detect and tag the following types of entities:

PER: person names
ORG: organism and company names
EMAIL: email addresses
PHONE: phone numbers
ADDRESS: addresses
ID: personal card IDs (such as spanish DNIs)
MISC: other personal data, or when the type it's uncertain
OTHER: other

Installation & Requirements

Binonymizer works with Python 3.6, and can be installed with pip:

python3.6 -m pip install binonymizer

After installation, two binary files (binonymizer and binonymizer-lite) will be located in your python/installation/prefix/bin directory.

Language-dependant packages and models are automatically downloaded and installed on runtime, if needed.

Extra instructions for basque

In case you plan to binonymize basque data, you need to download binonymizer from github, and run the following steps:

cd binonymizer
git submodule sync
git submodule update --init --recursive --remote
cd prompsit_python_bindings
python3.6 setup.py install

Please note that you need to have access to Prompsit's private repository. Contact us if you need further details.

Usage

Binonymizer can be run with:

binonymizer [-h] --format {tmx,cols} [--tmp_dir TMP_DIR]
                     [-b BLOCK_SIZE] [-p PROCESSES] [-q] [--debug]
                     [--logfile LOGFILE] [-v]
                     input [output] srclang trglang

Parameters

positional arguments:
- input: File to be anonymized (See format below)
- output: File with anonymization annotations (default: standard output)
- srclang: Source language code of the input
- trglang: Target language code of the input
optional arguments:
- -h, --help: show this help message and exit
Mandatory:
- --format {tmx,cols}: Input file format. Values: cols, tmx ("cols" format: URL1 URL2 SOURCE_SENTENCE TARGET_SENTENCE [extra columns] tab-separated)
Optional:
- --tmp_dir TMP_DIR: Temporary directory where creating the temporary files of this program (default: default system temp dir, defined by the environment variable TMPDIR in Unix)
- -b BLOCK_SIZE, --block_size BLOCK_SIZE: Sentence pairs per block (default: 10000)
- -p PROCESSES, --processes PROCESSES: Number of processes to use (default: all CPUs minus one)
Logging:
- -q, --quiet: Silent logging mode (default: False)
- --debug: Debug logging mode (default: False)
- --logfile LOGFILE: Store log to a file (default: standard error output)
- -v, --version: show version of this script and exit

Example

binonymizer corpus.en-es.raw corpus.en-es.anon en es --format cols  --tmp_dir /tmpdir -b50000 -p31

This will read the corpus "corpus.en-es.raw", which is in a column-based format, extracting personal data and writing the tagged output in "corpus.en-es.anon". Binonymizer will run in blocks of 50000 sentences, using 31 cores, and writing temporary files in /tmpdir

Lite version

Although binonymizer makes use of parallelization by distributing workload to the available cores, some users might prefer to implement their own parallelization strategies. For that reason, a single-thread version of the script is provided: binonymizer_lite. The usage is exactly the same as for the full version, but omitting the blocksize (-b) and processes (-p) parameter.

TO DO

Fully support TMX input/output
Address recognition
GPU support
Automate Prompsit-python-bindings submodule ( git submodule update --remote , python3.6 setup.py install)

¹: See EC definition of "personal information": https://ec.europa.eu/info/law/law-topic/data-protection/reform/what-personal-data_en

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Feb 27, 2019

0.1

Feb 27, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

binonymizer-0.1.1.tar.gz (15.0 kB view details)

Uploaded Feb 27, 2019 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

binonymizer-0.1.1-py3-none-any.whl (33.7 kB view details)

Uploaded Feb 27, 2019 Python 3

File details

Details for the file binonymizer-0.1.1.tar.gz.

File metadata

Download URL: binonymizer-0.1.1.tar.gz
Upload date: Feb 27, 2019
Size: 15.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.3

File hashes

Hashes for binonymizer-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`8c72e8c1191564ea98bd65e41f86896d278929a3a024b6a84f21dcde12b2a76f`
MD5	`9a9f1ec55072fbaf5dabc84dd0017fe8`
BLAKE2b-256	`5065bf7e08f216262b6ea4fcbdcf3b873c38e35f53c35c140f0331b880291cd2`

See more details on using hashes here.

File details

Details for the file binonymizer-0.1.1-py3-none-any.whl.

File metadata

Download URL: binonymizer-0.1.1-py3-none-any.whl
Upload date: Feb 27, 2019
Size: 33.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.3

File hashes

Hashes for binonymizer-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5f6e3cb226daae912843ca82a19860b24b633dc0dbebb67ae4af4717816e4658`
MD5	`6d9411a6444e307518ace451a41c3d65`
BLAKE2b-256	`43c222b06d02e1187ce12716b5b339ad23606abf15e85fde9c5a706ba96381d2`

See more details on using hashes here.

binonymizer 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

binonymizer

Detectable entity tipes

Installation & Requirements

Extra instructions for basque

Usage

Parameters

Example

Lite version

TO DO

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes