Skip to main content

Binonymizer is a tool in Python that aims at tagging personal data in a parallel corpus.

Project description

binonymizer

Binonymizer is a tool in Python that aims at tagging personal data1 in a parallel corpus.

For example, for a input like:

URL1  URL2  My name is Marta and my email is fake@email.com    Mi nombre es Marta y mi email es fake@email.com

Binonymizer's output will be:

URL1 URL2 My name is <entity class="PER">Marta</entity> and my email is <entity class="EMAIL">fake@email.com</entity> Mi nombre es <entity class="PER">Marta</entity> y mi email es <entity class="EMAIL">fake@email.com</entity>

Detectable entity tipes

Currently, the Binonymizer is able to detect and tag the following types of entities:

  • PER: person names
  • ORG: organism and company names
  • EMAIL: email addresses
  • PHONE: phone numbers
  • ADDRESS: addresses
  • ID: personal card IDs (such as spanish DNIs)
  • MISC: other personal data, or when the type it's uncertain
  • OTHER: other

Installation & Requirements

Binonymizer works with Python 3.6.

Requirements can be installed by using pip:

python3.6 -m pip install -r requirements.txt

Language-dependant packages and models are automatically downloaded and installed on runtime, if needed.

Usage

Binonymizer can be run with:

binonymizer.py [-h] --format {tmx,cols} [--tmp_dir TMP_DIR]
                     [-b BLOCK_SIZE] [-p PROCESSES] [-q] [--debug]
                     [--logfile LOGFILE] [-v]
                     input [output] srclang trglang

Parameters

  • positional arguments:
    • input: File to be anonymized (See format below)
    • output: File with anonymization annotations (default: standard output)
    • srclang: Source language code of the input
    • trglang: Target language code of the input
  • optional arguments:
    • -h, --help: show this help message and exit
  • Mandatory:
    • --format {tmx,cols}: Input file format. Values: cols, tmx ("cols" format: URL1 URL2 SOURCE_SENTENCE TARGET_SENTENCE [extra columns] tab-separated)
  • Optional:
    • --tmp_dir TMP_DIR: Temporary directory where creating the temporary files of this program (default: default system temp dir, defined by the environment variable TMPDIR in Unix)
    • -b BLOCK_SIZE, --block_size BLOCK_SIZE: Sentence pairs per block (default: 10000)
    • -p PROCESSES, --processes PROCESSES: Number of processes to use (default: all CPUs minus one)
  • Logging:
    • -q, --quiet: Silent logging mode (default: False)
    • --debug: Debug logging mode (default: False)
    • --logfile LOGFILE: Store log to a file (default: standard error output)
    • -v, --version: show version of this script and exit

Example

python3.6 binonymizer.py corpus.en-es.raw corpus.en-es.anon en es --format cols  --tmp_dir /tmpdir -b50000 -p31 

This will read the corpus "corpus.en-es.raw", which is in a column-based format, extracting personal data and writing the tagged output in "corpus.en-es.anon". Binonymizer will run in blocks of 50000 sentences, using 31 cores, and writing temporary files in /tmpdir

Lite version

Although binonymizer makes use of parallelization by distributing workload to the available cores, some users might prefer to implement their own parallelization strategies. For that reason, a single-thread version of the scripts is provided: binonymizer_lite.py. The usage is exactly the same as for the full version, but omitting the blocksize (-b) and processes (-p) parameter.

TO DO

  • Pip installable
  • Fully support TMX input/output
  • Address recognition
  • GPU support
  • Automate Prompsit-python-bindings submodule ( git submodule update --remote , python3.6 setup.py install)

1: See EC definition of "personal information": https://ec.europa.eu/info/law/law-topic/data-protection/reform/what-personal-data_en

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

binonymizer-0.1.tar.gz (14.7 kB view details)

Uploaded Source

Built Distribution

binonymizer-0.1-py3-none-any.whl (33.5 kB view details)

Uploaded Python 3

File details

Details for the file binonymizer-0.1.tar.gz.

File metadata

  • Download URL: binonymizer-0.1.tar.gz
  • Upload date:
  • Size: 14.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.3

File hashes

Hashes for binonymizer-0.1.tar.gz
Algorithm Hash digest
SHA256 9598dba282d18cd7d6bd5fdf34e687cca3e28c6493c23b25bc81287f03dfd40a
MD5 df20034ded2df41f0ab21e937a43bd85
BLAKE2b-256 dec527af046286fbc286675bc4a421ee60e9a715b9ce0fbffc39afed2a059b25

See more details on using hashes here.

File details

Details for the file binonymizer-0.1-py3-none-any.whl.

File metadata

  • Download URL: binonymizer-0.1-py3-none-any.whl
  • Upload date:
  • Size: 33.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.3

File hashes

Hashes for binonymizer-0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 64245fdb79c3a64e8441a683f200a2678606b1b319c39719363c43cdf65e33f0
MD5 0813dad3dca4a89417db8e88b8e16e5d
BLAKE2b-256 6cc75ce95de6af4e872c1eb44fb7882661d76d31c02b22d1de921fce587c2e9c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page