Skip to main content

A Python module for de-identifying personally identifiable information in text

Project description

Deidentification

A Python module that removes personally identifiable information (PII) from text documents, focusing on personal names and gender-specific pronouns. This tool uses spaCy's Named Entity Recognition (NER) capabilities combined with custom pronoun handling to provide thorough text de-identification.

Key Features

  • Accurately identifies and replaces personal names using spaCy's NER
  • Handles gender-specific pronouns with customizable replacements
  • Supports both plain text and HTML output formats
  • Uses an optimized backward-processing strategy for accurate text replacements
  • Iterative processing ensures comprehensive PII removal
  • Configurable replacement tokens and debug output
  • GPU acceleration support through spaCy

Installation

pip install text-deidentification

# or...

pip install git+https://github.com/jftuga/deidentification.git

Requirements

  • Python 3.10 or higher
  • spaCy's en_core_web_trf model (or another compatible model)

Download the required spaCy model:

python -m spacy download en_core_web_trf

Usage

Command Line Interface

The package includes a command-line tool for quick de-identification of text files:

deidentify input_file [options]
# or:
python -m deidentification.deidentify input_file [options]

Options:

  • -r, --replacement TEXT: Specify replacement text for identified names (default: "PERSON")
  • -o, --output FILE: Output file (defaults to stdout)
  • -H, --html: Output in HTML format with highlighted replacements
  • -d, --debug: Enable debug mode
  • -t, --tokens: Save identified elements to a JSON file (filename--tokens.json)
  • -v, --version: Display version information

Example:

# De-identify a text file and save with HTML markup
deidentify input.txt -H -o output.html -r "[REDACTED]"

Python API Usage

from deidentification import Deidentification

# Create a deidentification instance with default settings
deidentifier = Deidentification()

# Process text
text = "John Smith went to the store. He bought some groceries."
deidentified_text = deidentifier.deidentify(text)
print(deidentified_text)
# Output: "PERSON went to the store. HE/SHE bought some groceries."

HTML Output

# Generate HTML output with highlighted replacements
html_output = deidentifier.deidentify_with_wrapped_html(text)

Custom Configuration

from deidentification import (
    Deidentification,
    DeidentificationConfig,
    DeidentificationOutputStyle,
)

config = DeidentificationConfig(
    spacy_model="en_core_web_trf",
    output_style=DeidentificationOutputStyle.HTML,
    replacement="[REDACTED]",
    debug=True
)
deidentifier = Deidentification(config)

Configuration Options

The DeidentificationConfig class supports the following options:

  • spacy_load (bool): Whether to load the spaCy model (default: True)
  • spacy_model (str): Name of the spaCy model to use (default: "en_core_web_trf")
  • output_style (DeidentificationOutputStyle): Output format - TEXT or HTML (default: TEXT)
  • replacement (str): Replacement text for identified names (default: "PERSON")
  • debug (bool): Enable debug output (default: False)

How It Works

The de-identification process follows these steps:

  1. Text is normalized for consistent processing
  2. spaCy processes the text to identify person entities
  3. Gender-specific pronouns are identified using a predefined list
  4. Entities and pronouns are sorted by their position in reverse order
  5. Replacements are made from end to beginning to maintain position accuracy
  6. The process repeats until no new entities are detected

The backward-processing strategy is key to accurate replacements, as it prevents position shifts from affecting subsequent replacements.

Debug Output

When debug mode is enabled, the tool provides detailed information about:

  • Identified person entities
  • Found pronouns
  • Replacement positions and actions
  • Processing iterations

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text_deidentification-1.2.1.tar.gz (15.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

text_deidentification-1.2.1-py3-none-any.whl (15.8 kB view details)

Uploaded Python 3

File details

Details for the file text_deidentification-1.2.1.tar.gz.

File metadata

  • Download URL: text_deidentification-1.2.1.tar.gz
  • Upload date:
  • Size: 15.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.7

File hashes

Hashes for text_deidentification-1.2.1.tar.gz
Algorithm Hash digest
SHA256 5c96183caacee083c9a3d1cfb60b8de88c30a5b32c8711e19185e077095bd897
MD5 d1f416d5ca5e538321643d1ae82a4fc4
BLAKE2b-256 97707457fe28a40fd1f67bd85ee011bf9cacb291c7f304e3f33e36f07ef6f2ff

See more details on using hashes here.

File details

Details for the file text_deidentification-1.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for text_deidentification-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0c31f42804aef4cccad84e1756cfa2805ae1eadb53b27e9dcdb4086f7fe86104
MD5 e8fa17cebd74cf7c496bee9250b47d81
BLAKE2b-256 5b06ffddb0e55b858f4b5848fbee44203c8f4e571894c36d720dd237877d52a6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page