Skip to main content

Data anonymization library

Project description

Anonymizers 🎭 - ner, regex, flash text anonymizer

This project is a data anonymization library built with Rust 🦀.

It uses three techniques to anonymize data:

  • Named Entity Recognition (NER),
  • Flash Text,
  • Regular Expressions (Regex).

Library can be used:

  • python library 🐍
  • rust library 🦀
  • REST API 🌐
  • Docker image 🐠

Anonymizers

Named Entity Recognition (NER)

This method enables the library to identify and anonymize sensitive named entities in your data, like names, organizations, locations, and other personal identifiers.

Anonymizers library uses ML models in onnx format (using tract).

To prepare onnx model (they are only for converting model to onnx thus you don't need them during inference) additional libraries will be required:

pip install torch onnx sacremoses transformers[onnx]

You can use exsting models from HuggingFace (please note that repository license is only associated with library code) eg

import os
import transformers
from transformers import AutoModelForMaskedLM, AutoTokenizer, AutoModelForTokenClassification
from transformers.onnx import FeaturesManager
from pathlib import Path
from transformers import pipeline

model_id='dslim/bert-base-NER'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)

feature='token-classification'

model_kind, model_onnx_config = FeaturesManager.check_supported_model_or_raise(model, feature=feature)
onnx_config = model_onnx_config(model.config)

output_dir = "./dslim"
os.makedirs(output_dir, exist_ok=True)

# export
onnx_inputs, onnx_outputs = transformers.onnx.export(
        preprocessor=tokenizer,
        model=model,
        config=onnx_config,
        opset=13,
        output=Path(output_dir+"/model.onnx")
)

print(onnx_inputs)
print(onnx_outputs)
tokenizer.save_pretrained(output_dir)

configuration file config.yaml:

pipeline:
  - kind: ner
    model_path: ./examples/dslim/model.onnx
    tokenizer_path: ./examples/dslim/tokenizer.json
    id2label:
      "0": ["O", false]
      "1": ["B-MISC", true]
      "2": ["I-MISC", true]
      "3": ["B-PER", true]
      "4": ["I-PER", true]
      "5": ["B-ORG", true]
      "6": ["I-ORG", true]
      "7": ["B-LOC", true]
      "8": ["I-LOC", true]

Flash Text

A fast method for searching and replacing words in large datasets, used to anonymize predefined sensitive information.

configuration file config.yaml:

pipeline:
  - kind: flashText
    name: FRUIT_FLASH
    file: ./tests/config/fruits.txt
    keywords:
    - apple
    - banana
    - plum

Regex

This method provides a flexible way to identify and anonymize data patterns like credit card numbers, social security numbers, etc.

configuration file config.yaml:

pipeline:
  - kind: regex
    name: FRUIT_REGEX
    file: ./tests/config/fruits_regex.txt
    patterns:
    - \bapple\w*\b
    - \bbanana\w*\b
    - \bplum\w*\b

Usage

REST API

The library exposes a simple and user-friendly REST API, making it easy to integrate this anonymization functionality into your existing systems or applications.

git clone https://github.com/qooba/anonymize-rs
cd anonymize-rs

cargo run -- server --host 0.0.0.0 --port 8089 --config config.yaml

where config.yaml:

pipeline:
  - kind: flashText
    name: FRUIT_FLASH
    file: ./tests/config/fruits.txt
  - kind: regex
    name: FRUIT_REGEX
    file: ./tests/config/fruits_regex.txt
  - kind: ner
    model_path: ./examples/dslim/model.onnx
    tokenizer_path: ./examples/dslim/tokenizer.json
    token_type_ids_included: true
    id2label:
      "0": ["O", false] # [replacement, is_to_replaced]
      "1": ["B-MISC", true]
      "2": ["I-MISC", true]
      "3": ["B-PER", true]
      "4": ["I-PER", true]
      "5": ["B-ORG", true]
      "6": ["I-ORG", true]
      "7": ["B-LOC", true]
      "8": ["I-LOC", true]

Anonymization

curl -X POST "http://localhost:8080/api/anonymize" -H "accept: application/json" -H "Content-Type: application/json" -d '{"text":"I like to eat apples and bananas and plums"}'

or

curl -X GET "http://localhost:8080/api/anonymize?text=I like to eat apples and bananas and plums" -H "accept: application/json" -H "Content-Type: application/json"

Response:

{
    "text": "I like to eat FRUIT_FLASH0 and FRUIT_FLASH1 and FRUIT_REGEX0",
    "items": {
        "FRUIT_FLASH0": "apples",
        "FRUIT_FLASH1": "banans",
        "FRUIT_REGEX0": "plums"
    }
}

Deanonymization

curl -X POST "http://localhost:8080/api/deanonymize" -H "accept: application/json" -H "Content-Type: application/json" -d '{
    "text": "I like to eat FRUIT_FLASH0 and FRUIT_FLASH1 and FRUIT_REGEX0",
    "items": {
        "FRUIT_FLASH0": "apples",
        "FRUIT_FLASH1": "banans",
        "FRUIT_REGEX0": "plums"
    }
}'

Response:

{
    "text": "I like to eat apples and bananas and plums"
}

Docker image

You can simply run anonymization server using docker image:

docker run -it -v $(pwd)/config.yaml:config.yaml -p 8080:8080 qooba/anonymize-rs server --host 0.0.0.0 --port 8080 --config config.yaml

Python

pip install anonymizers
>>> from anonymizers import Ner, Regex, FlashText
>>> id2label={"0":("O",False),"1": ("B-MISC", True),"2": ("I-MISC", True),"3": ("B-PER", True),"4": ("I-PER", True),"5": ("B-ORG", True),"6": ("I-ORG", True),"7": ("B-LOC", True),"8": ("I-LOC", True)}
>>> ner_anonymizer = Ner("./dslim/model.onnx","./dslim/tokenizer.json", id2label)
MODEL LOADED: 3.25s
TOKENIZER LOADED: 14.10ms
>>> ner_anonymizer.anonymize("My name is Sarah and I live in London. I like London.")
('My name is B-PER0 and I live in B-LOC0. I like B-LOC0.', {'B-PER0': 'Sarah', 'B-LOC0': 'London'})
>>> from anonymizers import Ner, Regex, FlashText
>>> flash_anonymizer = FlashText("FRUIT", None, ["apple","banana","plum"])
>>> flash_anonymizer.anonymize("I like to eat apples and bananas and plums.")
('I like to eat FRUIT0 and FRUIT1 and FRUIT2.', {'FRUIT2': 'plums', 'FRUIT1': 'bananas', 'FRUIT0': 'apples'})

⚠️ Note: Anonymizers library can help identify sensitive/PII data in un/structured text. However, it uses automated detection mechanisms, and there is no guarantee that it will find all sensitive information. Consequently, additional systems and protections should be employed. This tool is meant to be a part of your privacy protection suite, not the entirety of it. Always ensure your data protection measures are comprehensive and multi-layered.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

anonymizers-0.0.2-cp39-cp39-manylinux_2_31_x86_64.whl (7.8 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.31+ x86-64

File details

Details for the file anonymizers-0.0.2-cp39-cp39-manylinux_2_31_x86_64.whl.

File metadata

File hashes

Hashes for anonymizers-0.0.2-cp39-cp39-manylinux_2_31_x86_64.whl
Algorithm Hash digest
SHA256 46d46e51c38a96b7ac49e95a4b1e68bfe58ff2a10031bcd518bfbce9eb31ef42
MD5 c0807f64a2b66f09e78186cdebf82ed8
BLAKE2b-256 5daf4dafb6660337a0906a7cf8cd00998a54218dad177f210416f991f8b9a73f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page