Create your own document de-identifier using docdeid, a simple framework independent of language or domain.
Project description
docdeid
Installation - Getting started - Features - Documentation - Development and contributiong - Authors - License
Create your own document de-identifier using docdeid, a simple framework independent of language or domain.
Note that
docdeidis still on version 0.x.x, and breaking changes might occur. If you plan to do extensive work involvingdocdeid, feel free to get in touch to coordinate.
Installation
Grab the latest version from PyPi:
pip install docdeid
Getting started
from docdeid import DocDeid
from docdeid.tokenize import WordBoundaryTokenizer
from docdeid.process SingleTokenLookupAnnotator, RegexpAnnotator, SimpleRedactor
deidentifier = DocDeid()
deidentifier.tokenizers["default"] = WordBoundaryTokenizer()
deidentifier.processors.add_processor(
"name_lookup",
SingleTokenLookupAnnotator(lookup_values=["John", "Mary"], tag="name"),
)
deidentifier.processors.add_processor(
"name_regexp",
RegexpAnnotator(regexp_pattern=re.compile(r"[A-Z]\w+"), tag="name"),
)
deidentifier.processors.add_processor(
"redactor",
SimpleRedactor()
)
text = "John loves Mary, but Mary loves William."
doc = deidentifier.deidentify(text)
Find the relevant info in the Document object:
print(doc.annotations)
AnnotationSet({
Annotation(text='John', start_char=0, end_char=4, tag='name', length=4),
Annotation(text='Mary', start_char=11, end_char=15, tag='name', length=4),
Annotation(text='Mary', start_char=21, end_char=25, tag='name', length=4),
Annotation(text='William', start_char=32, end_char=39, tag='name', length=7)
})
print(doc.deidentified_text)
'[NAME-1] loves [NAME-2], but [NAME-2] loves [NAME-3].'
Features
Additionally, docdeid features:
- Ability to create your own
Annotator,AnnotationProcessor,RedactorandTokenizercomponents - Some basic re-usable components included (e.g. regexp, token lookup, token patterns)
- Callable from one interface (
DocDeid.deidenitfy()) - String processing and filtering
- Fast lookup based on sets or tries
- Anything you add! PRs welcome.
For a more in-depth tutorial, see: docs/tutorial
Documentation
For full documentation and API, see: https://docdeid.readthedocs.io/en/latest/
Development and contributing
For setting up dev environment, see: docs/environment
For contributing, see: docs/contributing
Authors
Vincent Menger - Author, maintainer
License
This project is licensed under the MIT license - see the LICENSE.md file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docdeid-1.0.1.tar.gz.
File metadata
- Download URL: docdeid-1.0.1.tar.gz
- Upload date:
- Size: 20.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.10.17 Linux/6.11.0-1012-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aeff79b91a1b26dc026687c91a96298bcd9cb390ef325428c99eb59628a6c4de
|
|
| MD5 |
2ef73625f64612c11dcbb97faffaa19d
|
|
| BLAKE2b-256 |
e07db72936179fde77d501bb770d734493ee44872fc26a2604c4674bf233efc1
|
File details
Details for the file docdeid-1.0.1-py3-none-any.whl.
File metadata
- Download URL: docdeid-1.0.1-py3-none-any.whl
- Upload date:
- Size: 26.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.10.17 Linux/6.11.0-1012-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
884467dac254ed6dacfb8e8e0386f40553b70ed441ef3b4e2775159b5ffd97dc
|
|
| MD5 |
7c31e4e9da1f61dd732bc02db4efb89a
|
|
| BLAKE2b-256 |
b675082005d24241beb83019227f0edebe92721cdd0c1c59d6e1459100fa795a
|