Detect duplicated zones in (clinical) text
Project description
Duplicate Text Finder
duptextfinder is a python library to detect duplicated zones in text. Primarily meant to detect
copy/paste across medical documents. Should be faster than python's built-in
difflib algorithm and more robust to whitespace, newlines and other irrelevant
characters.
Installation
duptextfinder can be installed through pip:
pip install duptextfinder
Usage
from pathlib import Path
from duptextfinder import CharFingerprintBuilder, DuplicateFinder
# load some text files
texts = [p.read_text() for p in Path("some/dir").glob("*.txt")]
# init fingerprint and duplicate finder
fingerprintBuilder = CharFingerprintBuilder(fingerprintLength=15)
duplicateFinder = DuplicateFinder(fingerprintBuilder, minDuplicateLength=15)
# call findDuplicates() on each file
for i, text in enumerate(texts):
id = f"D{i}"
duplicates = duplicateFinder.findDuplicates(id, text)
for duplicate in duplicates:
print(
f"sourceDoc={duplicate.sourceDocId}, "
f"sourceStart={duplicate.sourceSpan.start}, "
f"sourceEnd={duplicate.sourceSpan.end}, "
f"targetStart={duplicate.targetSpan.start}, "
f"targetEnd={duplicate.targetSpan.end}"
)
duplicated_text = text[duplicate.targetSpan.start : duplicate.targetSpan.end]
print(duplicated_text)
WordFingerprintBuilder can be used instead of CharFingerprintBuilder. For
more details, refer to the docstrings of DuplicateFinder,
CharFingerprintBuilder and WordFingerprintBuilder.
How to run tests
- Install package in editable mode with test and extra dependencies by running
pip install -e ".[tests, ncls, intervaltree]"in the repo directory - Launch
pytest tests/
About ncls and intervaltree
This tool can be used without any additional dependencies, but performance can be improved when using interval trees. To benefit from this you well need to install either the ncls package or the intervaltree package.
References
- Evaluating the Impact of Text Duplications on a Corpus of More than 600,000 Clinical Narratives in a French Hospital. https://www.hal.inserm.fr/hal-02265124/
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file duptextfinder-0.3.0.tar.gz.
File metadata
- Download URL: duptextfinder-0.3.0.tar.gz
- Upload date:
- Size: 19.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a8ff3f3128bdc56157b2d09778e9ca2d73b093fea5419947418c07f83eaba08e
|
|
| MD5 |
d4e1d32863bedf4450d62748fed6fa7f
|
|
| BLAKE2b-256 |
67668796a3b0156aa70584e768bd66002a219eec868c215b9c83a90e28d26c5b
|
File details
Details for the file duptextfinder-0.3.0-py3-none-any.whl.
File metadata
- Download URL: duptextfinder-0.3.0-py3-none-any.whl
- Upload date:
- Size: 17.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b23072f5839a71240c43e288902024ee09a4491a64eb102619e4028f0253e37d
|
|
| MD5 |
5fc4d746e4823e999cc8052541f37e4e
|
|
| BLAKE2b-256 |
c650c45b26f67e3301efeb9cfd11b7509f18c794439cbfffe250392380f11e1d
|