Skip to main content

Redact Text with HuggingFace Models

Project description

🤗 Redactions

HuggingFace Redactions (hufr) redacts personal identifiable information from text using pretrained language models from the HuggingFace model repository. This packge wraps token classification models to streamline the redaction of personal identifiable information from free text. This project is not associated with the official HuggingFace organization, just a fun side project for this individual contributor.

Installation

To install this package, run pip install hufr

Usage

See below for an example snippet to load a specific token classification library from the HuggingFace model zoo:

from hufr.models import TokenClassificationTransformer
from hufr.redact import redact_text
from transformers.tokenization_utils_base import BatchEncoding

model_path = "dslim/bert-base-NER"
model = TokenClassificationTransformer(
    model=model_path,
    tokenizer=model_path
)

text = "Hello! My name is Rob"
redact_text(
    text,
    redaction_map={'PER': '<PERSON>'},
    model=model
)

> `"Hello! My name is <PERSON>"`

If you don't want to instantiate a model and supply a specific token classification model, then you can simply rely on the repository defaults for a quick and simple redaction:

from hufr.redact import redact_text

text = "Hello! My name is Rob"
redact_text(text)

To get the predicted entity for each word in the original text:

from hufr.redact import redact_text

text = "Hello! My name is Rob"
redact_text(text, return_preds=True)

> "Hello! My name is <PERSON>", ['O', 'O', 'O', 'O', 'PER']

By default, personal identifiable information is predicted by the dslim/bert-base-NER model where entities are mapped to redactions using the following mapping table:

'PER': '<PERSON>',
'MIS': '<OTHER>',
'ORG': '<ORGANIZATION>',
'LOC': '<LOCATION>'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hufr-2.0.1.tar.gz (11.1 kB view details)

Uploaded Source

Built Distribution

hufr-2.0.1-py3-none-any.whl (14.0 kB view details)

Uploaded Python 3

File details

Details for the file hufr-2.0.1.tar.gz.

File metadata

  • Download URL: hufr-2.0.1.tar.gz
  • Upload date:
  • Size: 11.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for hufr-2.0.1.tar.gz
Algorithm Hash digest
SHA256 ac4b1a781db5bce0446162ba0bd94cd8cf9a4e54cdcfdd4e5a72260c689372a5
MD5 c9e7b523e602b3f25122c8abd770ed7c
BLAKE2b-256 a615dbd64cac250f4575c069481c98a839ebf83bc271f9140c27e645a0cad477

See more details on using hashes here.

File details

Details for the file hufr-2.0.1-py3-none-any.whl.

File metadata

  • Download URL: hufr-2.0.1-py3-none-any.whl
  • Upload date:
  • Size: 14.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for hufr-2.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b90b52a1d14063eb97186ac4522a628a0cafe23283762d7de8fb4fd5cd870936
MD5 37a0c40e67704ee364c6c8abc9c8922c
BLAKE2b-256 cf855cc65e9777dc6e80261f638d163709455ff4a55a136e2407f3aaa5ec25a4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page