A simple text deidentification tool, built on spacy's state-of-the-art named entity recognition pipeline
Project description
pydeidentify
An simple tool for text deidentification, built on spacy's state-of-the-art named entity recognition pipeline
Usage
View more detailed examples at https://github.com/Lucasc-99/pydeidentify
DISCLAIMER: this tool is not 100% accurate, and may miss some entities
The model is also case sensitive, and will have decreased accuracy if text is all lower-case
# Basic usage, see examples/long_example.py for more
from pydeidentify import Deidentifier, DeidentifiedText
# Deidentify using this Deidentifier class
d = Deidentifier()
d_text: DeidentifiedText = d.deidentify(
"""My name is Joe Biden, I'm from Scranton, Pennsylvania and I like to create python packages. I was born 12-1-1999."""
)
# View output of deidentification using DeidentifiedText class
print(d_text.original()) # My name is Joe Biden, I'm from Scranton, Pennsylvania and I like to create python packages. I was born 12-1-1999.
print(d_text) # My name is PERSON0, I'm from GPE0, GPE1 and I like to create python packages. I was born DATE0.
print(d_text.encode_mapping) # {'Joe Biden': 'PERSON0', 'Scranton': 'GPE0', 'Pennsylvania': 'GPE1', '12-1-1999': 'DATE0'}
print(d_text.decode_mapping) # {'PERSON0': 'Joe Biden', 'GPE0': 'Scranton', 'GPE1': 'Pennsylvania', 'DATE0': '12-1-1999'}
print(d_text.counts) # {'ORG': 0, 'LOC': 0, 'PERSON': 1, 'GPE': 2, 'DATE': 1, 'FAC': 0}
# Use any spacy model that supports named entity recognition by passing it's name in the spacy_model parameter
# The line below loads the chinese version of the default english model: 'en_core_web_trf'
d_spanish = Deidentifier(spacy_model="zh_core_web_trf")
License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pydeidentify-0.2.0.tar.gz
(3.9 kB
view hashes)
Built Distribution
Close
Hashes for pydeidentify-0.2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 35cc0453f13b6d21d4f4077e8eec7d381346fd28a5e2ff2b92c3372138818114 |
|
MD5 | fbc1f477255ab45da27a257210905d73 |
|
BLAKE2b-256 | 893c27cd3758920b441cf271fb81b8dd142ec382d2b4ac0280d18f231080e22b |