Skip to main content

A simple pipeline wrapped around SpaCy and DaCy for anonymizing danish corpora.

Project description

docs/imgs/header.png

Anonymization tool for Danish text

https://img.shields.io/pypi/v/DaAnonymization.svg Downloads

Description

A simple pipeline wrapped around SpaCy and DaCy for anonymizing danish corpora. The pipeline allows for custom functions to be implemented and piped in combination with custom functions.

The DaCy model is built on multilingual RoBERTa which enables cross-lingual transfer for other languagues ultimately providing a robust named entity recognition model for anonymization that is able to handle noisy Danish text data which could include other languages.

Languages used in the multilingual RoBERTa can be found in appendix A of XLM-RoBERTa paper: Unsupervised Cross lingual Representation Learning at Scale

  • Free software: Apache-2.0 license

Disclaimer: As the pipeline utilizes predictive models and regex function to identify entities, there is no guarantee that all sensitive information have been remove.

Features

  • Regex for CPRs, telephone numbers, emails

  • Integration of custom functions as part of the pipeline

  • Named Entity Models for Danish language implemented (PER, LOC, ORG, MISC):
    • DaCy: https://github.com/KennethEnevoldsen/DaCy

    • Default entities to mask: PER, LOC and ORG (MISC can be specified but covers many different entitites)

    • Batch mode and multiprocessing

    • DaCy is robust to language changes as it is fine tuned from a multilingual RoBERTa model

  • Allow anonymizing using suppression

  • Allow masking to be aware of prior knowledge about individuals occuring in the texts

  • Pseudonymization module (Person 1, Person 2 etc.)

  • Logging to masked_corpus function, enabling tracking of warning if no person was found in a text

  • Beta version: Masking or noising with laplace epsilon noise of numbers

Installation

Install from pip

pip install DaAnonymization

Install from source

pip install git+https://github.com/martincjespersen/DaAnonymization.git

Quickstart

DaAnonymization’s two main components are:

  • TextAnonymizer

  • TextPseudonymizer

Both components uses their mask_corpus function to anonymize/pseudonymize text by removing person, location, organization, email, telephone number and CPR. The order of these masking methods are by default CPR, telephone number, email and NER (PER,LOC,ORG) as NER will identify names in the emails. The following example shows an example of applying default anonymization and how it also cross-lingual transfer to english.

from textprivacy import TextAnonymizer

# list of texts (example with cross-lingual transfer to english)
corpus = [
    "Hej, jeg hedder Martin Jespersen og er fra Danmark og arbejder i "
    "Deloitte, mit cpr er 010203-2010, telefon: +4545454545 "
    "og email: martin.martin@gmail.com",
    "Hi, my name is Martin Jespersen and work in Deloitte. "
    "I used to be a PhD. at DTU in Machine Learning and B-cell immunoinformatics "
    "at Anker Engelunds Vej 1 Bygning 101A, 2800 Kgs. Lyngby.",
]

Anonymizer = TextAnonymizer(corpus)

# Anonymize person, location, organization, emails, CPR and telephone numbers
anonymized_corpus = Anonymizer.mask_corpus()

for text in anonymized_corpus:
    print(text)

Running this script outputs the following:

Hej, jeg hedder [PERSON] og er fra [LOKATION] og arbejder i [ORGANISATION], mit cpr er [CPR],
telefon: [TELEFON] og email: [EMAIL]

Hi, my name is [PERSON] and work in [ORGANISATION]. I used to be a PhD. at [ORGANISATION]
in Machine Learning and B-cell immunoinformatics at [LOKATION].

Using custom masking functions

As each project can have specific needs, DaAnonymization supports adding custom functions to the pipeline for masking additional features which are not implemented by default.

from textprivacy import TextAnonymizer
import re

# Takes string as input and returns a set of all occurences
example_custom_function = lambda x: set(list(re.findall(r"\d+ år", x)))

# list of texts
corpus = [
    "Hej, jeg hedder Martin Jespersen, er 20 år, er fra Danmark og arbejder i "
    "Deloitte, mit cpr er 010203-2010, telefon: +4545454545 "
    "og email: martin.martin@gmail.com",
]

Anonymizer = TextAnonymizer(corpus)

# update the mapping to include new custom function entity finder and replacement placeholder
Anonymizer.mapping.update({"ALDER": "[ALDER]"})

# add the name to masking_order in the desired order
# add custom function to custom_functions to update pool of possible masking functions
anonymized_corpus = Anonymizer.mask_corpus(
    masking_order=["CPR", "TELEFON", "EMAIL", "NER", "ALDER"],
    custom_functions={"ALDER": example_custom_function},
)

for text in anonymized_corpus:
    print(text)
Hej, jeg hedder [PERSON], er [ALDER], er fra [LOKATION] og arbejder i [ORGANISATION],
mit cpr er [CPR], telefon: [TELEFON] og email: [EMAIL]

Pseudonymization with prior knowledge

Sometimes it can be useful to maintain some context regarding sensitive information within the text. Pseudonymization allows for maintaining the connection between entities while masking them. Essentially this means adding a unique identifier for each individual and their information in the text.

By using the optional input argument individuals, you can add prior information about known individuals in the text you want to mask. The structure of individuals needs to be as shown below. The first dictionary provides a key for index of the text in the corpus, the next the unique identifier (integer) of the individuals and finally a dictionary of entities known prior for each individual.

from textprivacy import TextPseudonymizer

# prior information about the text
individuals = {1:
                {1:
                    {'PER': set(['Martin Jespersen', 'Martin', 'Jespersen, Martin']),
                     'CPR': set(['010203-2010']),
                     'EMAIL': set(['martin.martin@gmail.com']),
                     'LOC': set(['Danmark']),
                     'ORG': set(['Deloitte'])
                     },
                2:
                    {'PER': set(['Kristina']),
                     'ORG': set(['Novo Nordisk'])
                     }
                 }

              }

# list of texts
corpus = [
    "Første tekst om intet, blot Martin",
    "Hej, jeg hedder Martin Jespersen og er fra Danmark og arbejder i "
    "Deloitte, mit cpr er 010203-2010, telefon: +4545454545 "
    "og email: martin.martin@gmail.com. Martin er en 20 årig mand. "
    "Kristina er en person som arbejder i Novo Nordisk. "
    "Frank er en mand som bor i Danmark og arbejder i Netto",
]

Pseudonymizer = TextPseudonymizer(corpus, individuals=individuals)

# Pseudonymize person, location, organization, emails, CPR and telephone numbers
pseudonymized_corpus = Pseudonymizer.mask_corpus()

for text in pseudonymized_corpus:
    print(text)
Første tekst om intet, blot Person 1

Hej, jeg hedder Person 1 og er fra Lokation 1 og arbejder i Organisation 1, mit cpr er CPR 1,
telefon: Telefon 5 og email: Email 1. Person 1 er en 20 årig mand. Person 2 er en person som
arbejder i Organisation 2. Person 3 er en mand som bor i Lokation 1 og arbejder i Organisation 4

Demo using streamlit

DaAnonymization is now available with an easy demo website created in streamlit.

pip install streamlit==1.2.0
streamlit run app.py

Running the code above will result in a website demoing the use of DaAnonymization.

docs/imgs/streamlit_app.png

Fairness evaluations

Evaluations on gender and error biases are conducted in DaCy documentation.

Next up

  • When SpaCy fixed multiprocessing in nlp.pipe, remove current hack

History

0.1.0 (2021-03-04)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

DaAnonymization-0.1.1.tar.gz (444.2 kB view details)

Uploaded Source

Built Distribution

DaAnonymization-0.1.1-py2.py3-none-any.whl (17.7 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file DaAnonymization-0.1.1.tar.gz.

File metadata

  • Download URL: DaAnonymization-0.1.1.tar.gz
  • Upload date:
  • Size: 444.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.13

File hashes

Hashes for DaAnonymization-0.1.1.tar.gz
Algorithm Hash digest
SHA256 f7c4c528d4db1fc9c5ded6daae3027f610a3443f926561e97c819d7fd5091cd1
MD5 985e0b5169c7a29ddc6c7252f7f2c1e7
BLAKE2b-256 27c1c09bee15822651d33ed912fcd77700a9f226198d5fb2ce2dd982a8fa0f5e

See more details on using hashes here.

File details

Details for the file DaAnonymization-0.1.1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for DaAnonymization-0.1.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 dd6cd8034b72d5fc1ee642e9e410be82a9283f587294e65ded5b38d49adee602
MD5 2b78d5334035cbb23e32e44266bf7d11
BLAKE2b-256 e4fdf8d726fc140a6f2ecaed055fc9b367a25f3c1b3e17af4562c798600a1e95

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page