Create your own document de-identifier using docdeid, a simple framework independent of language or domain.
Project description
docdeid
Installation - Getting started - Features - Documentation - Development and contributiong - Authors - License
Create your own document de-identifier using docdeid
, a simple framework independent of language or domain.
Note that
docdeid
is still on version 0.x.x, and breaking changes might occur. If you plan to do extensive work involvingdocdeid
, feel free to get in touch to coordinate.
Installation
Grab the latest version from PyPi:
pip install docdeid
Getting started
from docdeid import DocDeid
from docdeid.tokenize import WordBoundaryTokenizer
from docdeid.process SingleTokenLookupAnnotator, RegexpAnnotator, SimpleRedactor
deidentifier = DocDeid()
deidentifier.tokenizers["default"] = WordBoundaryTokenizer()
deidentifier.processors.add_processor(
"name_lookup",
SingleTokenLookupAnnotator(lookup_values=["John", "Mary"], tag="name"),
)
deidentifier.processors.add_processor(
"name_regexp",
RegexpAnnotator(regexp_pattern=re.compile(r"[A-Z]\w+"), tag="name"),
)
deidentifier.processors.add_processor(
"redactor",
SimpleRedactor()
)
text = "John loves Mary, but Mary loves William."
doc = deidentifier.deidentify(text)
Find the relevant info in the Document
object:
print(doc.annotations)
AnnotationSet({
Annotation(text='John', start_char=0, end_char=4, tag='name', length=4),
Annotation(text='Mary', start_char=11, end_char=15, tag='name', length=4),
Annotation(text='Mary', start_char=21, end_char=25, tag='name', length=4),
Annotation(text='William', start_char=32, end_char=39, tag='name', length=7)
})
print(doc.deidentified_text)
'[NAME-1] loves [NAME-2], but [NAME-2] loves [NAME-3].'
Features
Additionally, docdeid
features:
- Ability to create your own
Annotator
,AnnotationProcessor
,Redactor
andTokenizer
components - Some basic re-usable components included (e.g. regexp, token lookup, token patterns)
- Callable from one interface (
DocDeid.deidenitfy()
) - String processing and filtering
- Fast lookup based on sets or tries
- Anything you add! PRs welcome.
For a more in-depth tutorial, see: docs/tutorial
Documentation
For full documentation and API, see: https://docdeid.readthedocs.io/en/latest/
Development and contributing
For setting up dev environment, see: docs/environment
For contributing, see: docs/contributing
Authors
Vincent Menger - Author, maintainer
License
This project is licensed under the MIT license - see the LICENSE.md file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file docdeid-1.0.0.tar.gz
.
File metadata
- Download URL: docdeid-1.0.0.tar.gz
- Upload date:
- Size: 21.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.10.13 Linux/6.2.0-1018-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fea630e1dff140eb939c6474df8fcebe428c28c94eed5a5b9ae5c218205b0948 |
|
MD5 | 59b825f349f551f2f2339a95a3dbe89c |
|
BLAKE2b-256 | 001ea725d1d012bcc14dd671c6e456f12cadc9c37cd6ca11ffd0d02bcaaa6a58 |
File details
Details for the file docdeid-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: docdeid-1.0.0-py3-none-any.whl
- Upload date:
- Size: 26.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.10.13 Linux/6.2.0-1018-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d5d93ec3fbd8557a9cd41b56ec3774bc3a86575d8dc6a3becd486cdf2190993b |
|
MD5 | 5ee59fff68ea632243e5891585f4dc23 |
|
BLAKE2b-256 | 1f3e33aec857ccd7739c2b8f8744384bf450f3ca6c498eadcacd4a79e2acc6b8 |