Data anonymization library
Project description
Anonymizers 🎭 - ner, regex, flash text anonymizer
This project is a data anonymization library built with Rust 🦀.
It uses three techniques to anonymize data:
- Named Entity Recognition (NER),
- Flash Text,
- Regular Expressions (Regex).
Library can be used:
- python library 🐍
- rust library 🦀
- REST API 🌐
- Docker image 🐠
Anonymizers
Named Entity Recognition (NER)
This method enables the library to identify and anonymize sensitive named entities in your data, like names, organizations, locations, and other personal identifiers.
Anonymizers library uses ML models in onnx format (using tract).
To prepare onnx model (they are only for converting model to onnx thus you don't need them during inference) additional libraries will be required:
pip install torch onnx sacremoses transformers[onnx]
You can use exsting models from HuggingFace (please note that repository license is only associated with library code) eg
- en dslim/bert-base-NER
- pl clarin-pl/FastPDN
- multilanguage xlm-roberta-large-finetuned-conll03-english
import os
import transformers
from transformers import AutoModelForMaskedLM, AutoTokenizer, AutoModelForTokenClassification
from transformers.onnx import FeaturesManager
from pathlib import Path
from transformers import pipeline
model_id='dslim/bert-base-NER'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)
feature='token-classification'
model_kind, model_onnx_config = FeaturesManager.check_supported_model_or_raise(model, feature=feature)
onnx_config = model_onnx_config(model.config)
output_dir = "./dslim"
os.makedirs(output_dir, exist_ok=True)
# export
onnx_inputs, onnx_outputs = transformers.onnx.export(
preprocessor=tokenizer,
model=model,
config=onnx_config,
opset=13,
output=Path(output_dir+"/model.onnx")
)
print(onnx_inputs)
print(onnx_outputs)
tokenizer.save_pretrained(output_dir)
configuration file config.yaml
:
pipeline:
- kind: ner
model_path: ./examples/dslim/model.onnx
tokenizer_path: ./examples/dslim/tokenizer.json
id2label:
"0": ["O", false]
"1": ["B-MISC", true]
"2": ["I-MISC", true]
"3": ["B-PER", true]
"4": ["I-PER", true]
"5": ["B-ORG", true]
"6": ["I-ORG", true]
"7": ["B-LOC", true]
"8": ["I-LOC", true]
Flash Text
A fast method for searching and replacing words in large datasets, used to anonymize predefined sensitive information.
configuration file config.yaml
:
pipeline:
- kind: flashText
name: FRUIT_FLASH
file: ./tests/config/fruits.txt
keywords:
- apple
- banana
- plum
Regex
This method provides a flexible way to identify and anonymize data patterns like credit card numbers, social security numbers, etc.
configuration file config.yaml
:
pipeline:
- kind: regex
name: FRUIT_REGEX
file: ./tests/config/fruits_regex.txt
patterns:
- \bapple\w*\b
- \bbanana\w*\b
- \bplum\w*\b
Usage
REST API
The library exposes a simple and user-friendly REST API, making it easy to integrate this anonymization functionality into your existing systems or applications.
git clone https://github.com/qooba/anonymize-rs
cd anonymize-rs
cargo run -- server --host 0.0.0.0 --port 8089 --config config.yaml
where config.yaml
:
pipeline:
- kind: flashText
name: FRUIT_FLASH
file: ./tests/config/fruits.txt
- kind: regex
name: FRUIT_REGEX
file: ./tests/config/fruits_regex.txt
- kind: ner
model_path: ./examples/dslim/model.onnx
tokenizer_path: ./examples/dslim/tokenizer.json
token_type_ids_included: true
id2label:
"0": ["O", false] # [replacement, is_to_replaced]
"1": ["B-MISC", true]
"2": ["I-MISC", true]
"3": ["B-PER", true]
"4": ["I-PER", true]
"5": ["B-ORG", true]
"6": ["I-ORG", true]
"7": ["B-LOC", true]
"8": ["I-LOC", true]
Anonymization
curl -X POST "http://localhost:8080/api/anonymize" -H "accept: application/json" -H "Content-Type: application/json" -d '{"text":"I like to eat apples and bananas and plums"}'
or
curl -X GET "http://localhost:8080/api/anonymize?text=I like to eat apples and bananas and plums" -H "accept: application/json" -H "Content-Type: application/json"
Response:
{
"text": "I like to eat FRUIT_FLASH0 and FRUIT_FLASH1 and FRUIT_REGEX0",
"items": {
"FRUIT_FLASH0": "apples",
"FRUIT_FLASH1": "banans",
"FRUIT_REGEX0": "plums"
}
}
Deanonymization
curl -X POST "http://localhost:8080/api/deanonymize" -H "accept: application/json" -H "Content-Type: application/json" -d '{
"text": "I like to eat FRUIT_FLASH0 and FRUIT_FLASH1 and FRUIT_REGEX0",
"items": {
"FRUIT_FLASH0": "apples",
"FRUIT_FLASH1": "banans",
"FRUIT_REGEX0": "plums"
}
}'
Response:
{
"text": "I like to eat apples and bananas and plums"
}
Docker image
You can simply run anonymization server using docker image:
docker run -it -v $(pwd)/config.yaml:config.yaml -p 8080:8080 qooba/anonymize-rs server --host 0.0.0.0 --port 8080 --config config.yaml
Python
pip install anonymizers
>>> from anonymizers import Ner, Regex, FlashText
>>> id2label={"0":("O",False),"1": ("B-MISC", True),"2": ("I-MISC", True),"3": ("B-PER", True),"4": ("I-PER", True),"5": ("B-ORG", True),"6": ("I-ORG", True),"7": ("B-LOC", True),"8": ("I-LOC", True)}
>>> ner_anonymizer = Ner("./dslim/model.onnx","./dslim/tokenizer.json", id2label)
MODEL LOADED: 3.25s
TOKENIZER LOADED: 14.10ms
>>> ner_anonymizer.anonymize("My name is Sarah and I live in London. I like London.")
('My name is B-PER0 and I live in B-LOC0. I like B-LOC0.', {'B-PER0': 'Sarah', 'B-LOC0': 'London'})
>>> from anonymizers import Ner, Regex, FlashText
>>> flash_anonymizer = FlashText("FRUIT", None, ["apple","banana","plum"])
>>> flash_anonymizer.anonymize("I like to eat apples and bananas and plums.")
('I like to eat FRUIT0 and FRUIT1 and FRUIT2.', {'FRUIT2': 'plums', 'FRUIT1': 'bananas', 'FRUIT0': 'apples'})
⚠️ Note: Anonymizers library can help identify sensitive/PII data in un/structured text. However, it uses automated detection mechanisms, and there is no guarantee that it will find all sensitive information. Consequently, additional systems and protections should be employed. This tool is meant to be a part of your privacy protection suite, not the entirety of it. Always ensure your data protection measures are comprehensive and multi-layered.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file anonymizers-0.0.2-cp39-cp39-manylinux_2_31_x86_64.whl
.
File metadata
- Download URL: anonymizers-0.0.2-cp39-cp39-manylinux_2_31_x86_64.whl
- Upload date:
- Size: 7.8 MB
- Tags: CPython 3.9, manylinux: glibc 2.31+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 46d46e51c38a96b7ac49e95a4b1e68bfe58ff2a10031bcd518bfbce9eb31ef42 |
|
MD5 | c0807f64a2b66f09e78186cdebf82ed8 |
|
BLAKE2b-256 | 5daf4dafb6660337a0906a7cf8cd00998a54218dad177f210416f991f8b9a73f |