Skip to main content

Wowool Anonymizer

Project description

Ensuring data privacy

The anonymizer app detects and redacts personally identifiable information (PII) and sensitive entities from unstructured text. Its goal is to preserve privacy while retaining the utility of the original content for downstream processing or analysis.

Options

AnonymizerOptions

interface AnonymizerOptions {
    annotations?: string[];
    pseudonyms?: Record<string, string[]>;
    formatters?: Record<string, string>;
}

with

Property Description
annotations List of annotations to anonymize. If not provided, all annotations will be anonymized
pseudonyms Mapping from entity URI, such as Person or Company, to names associated with that entity type
formatters Mapping from entity URI and the corresponding formatter (f-string like) to convert the input data

Formatters

Predefined variables can be used to format the input data:

Property Description
uri URI of the entity
literal Literal text of the entity
canonical Normalized or canonicalized text, e.g. John Doe instead of he
concept Concept that you can use to anonymize (e.g. concept.gender )
anonymized Converted data

For example, consider the following formatters:

"formatters": {
    "Person": "#{uri}-{concept.position}-#{nr}",
    "PersonalIdentificationNumber": "#{\"*\"* (len(literal)-3)}{literal[-2:]}",
    "default": "{'.'*len(literal)}"
}
  • The first formatter will replace Person with the URI, the position and a counter. For instance, John Doe will be redacted as #Person-Lawyer-#3
  • The second will create a mask using the literal's length. For instance, 11-22-333 will be masked as *******33
  • The last one, which corresponds with the default formatter, will mask the whole length of the literal using dots. For instance, Ikea will be entirely redacted as ....

Results

AnonymizerResults

interface AnonymizerResults {
    text: string;
    locations: Location[];
}

with:

Property Description
text Anonymized text
locations Structured information of the changes that have been made

Location

interface Location {
    uri: string;
    text: string;
    anonymized: string;
    begin_offset: number;
    end_offset: number;
    byte_begin_offset: number;
    byte_end_offset: number;
}

with:

Property Description
uri URI of the entity that was anonymized, e.g. Person or Company
text Original text segment that was anonymized
anonymized Anonymized or pseudonymized version of the original text
begin_offset Starting character offset in the input document
end_offset Ending character offset in the input document
byte_begin_offset Starting byte offset in the input document
byte_end_offset Ending byte offset in the input document

Examples

Ensuring data privacy

The anonymizer app detects and redacts personally identifiable information (PII) and sensitive entities from unstructured text. Its goal is to preserve privacy while retaining the utility of the original content for downstream processing or analysis.

Options

AnonymizerOptions

interface AnonymizerOptions {
    annotations?: string[];
    pseudonyms?: Record<string, string[]>;
    formatters?: Record<string, string>;
}

with

Property Description
annotations List of annotations to anonymize. If not provided, all annotations will be anonymized
pseudonyms Mapping from entity URI, such as Person or Company, to names associated with that entity type
formatters Mapping from entity URI and the corresponding formatter (f-string like) to convert the input data

Formatters

Predefined variables can be used to format the input data:

Property Description
uri URI of the entity
literal Literal text of the entity
canonical Normalized or canonicalized text, e.g. John Doe instead of he
concept Concept that you can use to anonymize (e.g. concept.gender )
anonymized Converted data

For example, consider the following formatters:

"formatters": {
    "Person": "#{uri}-{concept.position}-#{nr}",
    "PersonalIdentificationNumber": "#{\"*\"* (len(literal)-3)}{literal[-2:]}",
    "default": "{'.'*len(literal)}"
}
  • The first formatter will replace Person with the URI, the position and a counter. For instance, John Doe will be redacted as #Person-Lawyer-#3
  • The second will create a mask using the literal's length. For instance, 11-22-333 will be masked as *******33
  • The last one, which corresponds with the default formatter, will mask the whole length of the literal using dots. For instance, Ikea will be entirely redacted as ....

Results

AnonymizerResults

interface AnonymizerResults {
    text: string;
    locations: Location[];
}

with:

Property Description
text Anonymized text
locations Structured information of the changes that have been made

Location

interface Location {
    uri: string;
    text: string;
    anonymized: string;
    begin_offset: number;
    end_offset: number;
    byte_begin_offset: number;
    byte_end_offset: number;
}

with:

Property Description
uri URI of the entity that was anonymized, e.g. Person or Company
text Original text segment that was anonymized
anonymized Anonymized or pseudonymized version of the original text
begin_offset Starting character offset in the input document
end_offset Ending character offset in the input document
byte_begin_offset Starting byte offset in the input document
byte_end_offset Ending byte offset in the input document

API

Examples

You will need to install the english language module to run the sample. pip install wowool-english

Anonymize known entities

This script finds entities in a sentence and replaces each character of those entities with a dot, then prints the anonymized output and structured information.

DefaultWriter(formatters={"default": "{'.'*len(literal)}"}) sets up a writer that replaces each character of any entity with a dot (.), matching the entity’s length.

from wowool.sdk import Pipeline
from wowool.anonymizer import Anonymizer, DefaultWriter
from json import dumps

# replace all characters of a entities with dot's
english = Pipeline("english,entity")
document = english("John Smith works for Ikea.")
writer = DefaultWriter(formatters={"default": "{'.'*len(literal)}"})
writer = DefaultWriter(formatters={"default": "###{anonymized_literal}"})
anonymizer = Anonymizer(writer=writer)
document = anonymizer(document)
results = document.results(Anonymizer.ID)
print(dumps(results, indent=2))

results:

{
  "text": ".......... works for .....",
  "locations": [
    {
      "begin_offset": 0,
      "end_offset": 10,
      "text": "John Smith",
      "uri": "Person",
      "anonymized": "..........",
      "byte_begin_offset": 0,
      "byte_end_offset": 10
    },
    {
      "begin_offset": 21,
      "end_offset": 25,
      "text": "IKEA",
      "uri": "Company",
      "anonymized": "....",
      "byte_begin_offset": 21,
      "byte_end_offset": 25
    }
  ]
}

Custom pseudonyms

This script replaces detected person and company names in the text with your chosen pseudonyms, then prints the anonymized result

from wowool.sdk import Pipeline
from wowool.anonymizer import Anonymizer, DefaultWriter

# note you can use the default pseudonyms if you want
# from wowool.anonymizer.core.anonymizer_config import DEFAULT_PSEUDONYMS
from json import dumps

# replace all characters of a entities with dot's
english = Pipeline("english,entity")
document = english("John Smith works for Ikea.")
pseudonyms = {
    "Person": ["Badman"],
    "Company": ["Monster Inc."],
}
writer = DefaultWriter(pseudonyms)
anonymizer = Anonymizer(writer=writer)
document = anonymizer(document)
results = document.results(Anonymizer.ID)
print(dumps(results, indent=2))

results:

{
  "text": "Badman works for Monster Inc..",
  "locations": [
    {
      "begin_offset": 0,
      "end_offset": 6,
      "text": "John Smith",
      "uri": "Person",
      "anonymized": "Badman",
      "byte_begin_offset": 0,
      "byte_end_offset": 10
    },
    {
      "begin_offset": 17,
      "end_offset": 29,
      "text": "IKEA",
      "uri": "Company",
      "anonymized": "Monster Inc.",
      "byte_begin_offset": 21,
      "byte_end_offset": 25
    }
  ]
}

License

In both cases you will need to acquirer a license file at https://www.wowool.com

Non-Commercial

This library is licensed under the GNU AGPLv3 for non-commercial use.  
For commercial use, a separate license must be purchased.  

Commercial license Terms

1. Grants the right to use this library in proprietary software.  
2. Requires a valid license key  
3. Redistribution in SaaS requires a commercial license.  

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wowool_anonymizer-2.2.2-py3-none-any.whl (45.6 kB view details)

Uploaded Python 3

File details

Details for the file wowool_anonymizer-2.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for wowool_anonymizer-2.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f607f2eb62d4924841ed4238f66d6ecbb92a52cfb04f3760ed5993039f57a60d
MD5 0cc3b09112251b2a998c3381fdfc6b7f
BLAKE2b-256 1795e346d6ab0c5e3ccb395477ce1cd2c8a0b7d07da1ea75fa3abee5c6b6189d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page