A Python module for de-identifying personally identifiable information in text

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 5 - Production/Stable
Intended Audience
License
- OSI Approved :: MIT License
Natural Language
- English
Operating System
- OS Independent
Programming Language
Topic
- Security
- Text Processing

Project description

Deidentification

A Python module that removes personally identifiable information (PII) from text documents, focusing on personal names and gender-specific pronouns. This tool uses spaCy's Named Entity Recognition (NER) capabilities combined with custom pronoun handling to provide thorough text de-identification.

Key Features

Accurately identifies and replaces personal names using spaCy's NER
Handles gender-specific pronouns with customizable replacements
Supports both plain text and HTML output formats
Uses an optimized backward-processing strategy for accurate text replacements
Iterative processing ensures comprehensive PII removal
Configurable replacement tokens and debug output
GPU acceleration support through spaCy

Installation

pip install text-deidentification

# or...

pip install git+https://github.com/jftuga/deidentification.git

Requirements

Python 3.10 or higher
spaCy's en_core_web_trf model (or another compatible model)

Download the required spaCy model:

python -m spacy download en_core_web_trf

Usage

Command Line Interface

The package includes a command-line tool for quick de-identification of text files:

deidentify input_file [options]
# or:
python -m deidentification.deidentify input_file [options]

Options:

-r, --replacement TEXT: Specify replacement text for identified names (default: "PERSON")
-o, --output FILE: Output file (defaults to stdout)
-H, --html: Output in HTML format with highlighted replacements
-d, --debug: Enable debug mode
-t, --tokens: Save identified elements to a JSON file (filename--tokens.json)
-x, --exclude EXCLUDE: comma-delimited list of entities to exclude from de-identification; or change with DEIDENTIFY_EXCLUDE_DELIM env var
-v, --version: Display version information

Example:

# De-identify a text file and save with HTML markup
deidentify input.txt -H -o output.html -r "[REDACTED]"

Python API Usage

from deidentification import Deidentification

# Create a deidentification instance with default settings
deidentifier = Deidentification()

# Process text
text = "John Smith went to the store. He bought some groceries."
deidentified_text = deidentifier.deidentify(text)
print(deidentified_text)
# Output: "PERSON went to the store. HE/SHE bought some groceries."

HTML Output

# Generate HTML output with highlighted replacements
html_output = deidentifier.deidentify_with_wrapped_html(text)

HTML Output Demo

deidentification html demo

Custom Configuration

from deidentification import (
    Deidentification,
    DeidentificationConfig,
    DeidentificationOutputStyle,
)

config = DeidentificationConfig(
    spacy_model="en_core_web_trf",
    output_style=DeidentificationOutputStyle.HTML,
    replacement="[REDACTED]",
    excluded_entities={"Joe Smith","Alice Jones"},
    debug=True
)
deidentifier = Deidentification(config)

Configuration Options

The DeidentificationConfig class supports the following options:

spacy_load (bool): Whether to load the spaCy model (default: True)
spacy_model (str): Name of the spaCy model to use (default: "en_core_web_trf")
output_style (DeidentificationOutputStyle): Output format - TEXT or HTML (default: TEXT)
replacement (str): Replacement text for identified names (default: "PERSON")
debug (bool): Enable debug output (default: False)

How It Works

The de-identification process follows these steps:

Text is normalized for consistent processing
spaCy processes the text to identify person entities
Gender-specific pronouns are identified using a predefined list
Entities and pronouns are sorted by their position in reverse order
Replacements are made from end to beginning to maintain position accuracy
The process repeats until no new entities are detected

The backward-processing strategy is key to accurate replacements, as it prevents position shifts from affecting subsequent replacements.

Debug Output

When debug mode is enabled, the tool provides detailed information about:

Identified person entities
Found pronouns
Replacement positions and actions
Processing iterations

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 5 - Production/Stable
Intended Audience
License
- OSI Approved :: MIT License
Natural Language
- English
Operating System
- OS Independent
Programming Language
Topic
- Security
- Text Processing

Release history Release notifications | RSS feed

1.3.2

May 3, 2025

1.3.1

Mar 24, 2025

This version

1.3.0

Jan 10, 2025

1.2.1

Jan 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text_deidentification-1.3.0.tar.gz (16.3 kB view details)

Uploaded Jan 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

text_deidentification-1.3.0-py3-none-any.whl (16.4 kB view details)

Uploaded Jan 10, 2025 Python 3

File details

Details for the file text_deidentification-1.3.0.tar.gz.

File metadata

Download URL: text_deidentification-1.3.0.tar.gz
Upload date: Jan 10, 2025
Size: 16.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.7

File hashes

Hashes for text_deidentification-1.3.0.tar.gz
Algorithm	Hash digest
SHA256	`c7602ea4475af4c892e06949c62760743fd78ff6a8931e47cc31962ccf355e0b`
MD5	`1df1279cc05d3ce7da3e275b869443f6`
BLAKE2b-256	`929920a0fc180ba9a153ae6c886691eab76675397de2a992485e6c4997caca7e`

See more details on using hashes here.

File details

Details for the file text_deidentification-1.3.0-py3-none-any.whl.

File metadata

Download URL: text_deidentification-1.3.0-py3-none-any.whl
Upload date: Jan 10, 2025
Size: 16.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.7

File hashes

Hashes for text_deidentification-1.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b85425875d57a60e08ba0e94c4fc6dcd3e3842bdeb6090cd0a8376e9b1d95f65`
MD5	`722a01ce949b7815fbf1aac3d24a8e82`
BLAKE2b-256	`419ae4a5c3db42ce494ada11133ceda5f719e8fe9ee6f989a00f40bfb470b744`

See more details on using hashes here.

text-deidentification 1.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Deidentification

Key Features

Installation

Requirements

Usage

Command Line Interface

Python API Usage

HTML Output

HTML Output Demo

Custom Configuration

Configuration Options

How It Works

Debug Output

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes