Skip to main content

Redact Text with HuggingFace Models

Project description

🤗 Redactions

HuggingFace Redactions (hufr) is a Python wrapper for HuggingFace token classification models to help redact personal identifiable information from free text. This project is not associated with the official HuggingFace organization, just a fun side project for this individual contributor.

Installation

To install this package, first clone the repository and then run pip install hufr/

Usage

See below for an example snippet to load a specific token classification library from the HuggingFace model zoo:

from hufr.models import TokenClassificationTransformer
from hufr.redact import redact_text
from transformers.tokenization_utils_base import BatchEncoding

model_path = "dslim/bert-base-NER"
model = TokenClassificationTransformer(
    model=model_path,
    tokenizer=model_path
)

text = "Hello! My name is Rob"
redact_text(
    text,
    redaction_map={'PER': '<PERSON>'},
    model=model
)

This will output:

`"Hello! My name is <PERSON>"'

If you don't want to instantiate a model and supply a specific token classification model, then you can simply rely on the repository defaults for a quick and simple redaction:

from hufr.redact import redact_text

text = "Hello! My name is Rob"
redact_text(text)

See the constants.py module for default model paths and default entity to redaction mapping.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hufr-0.1.0.tar.gz (7.3 kB view hashes)

Uploaded Source

Built Distribution

hufr-0.1.0-py3-none-any.whl (8.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page