Skip to main content

Detect duplicated zones in (clinical) text

Project description

Duplicate Text Finder

duptextfinder is a python library to detect duplicated zones in text. Primarily meant to detect copy/paste across medical documents. Should be faster than python's built-in difflib algorithm and more robust to whitespace, newlines and other irrelevant characters.

Installation

duptextfinder can be installed through pip:

pip install duptextfinder

Usage

from pathlib import Path
from duptextfinder import CharFingerprintBuilder, DuplicateFinder

# load some text files
texts = [p.read_text() for p in Path("some/dir").glob("*.txt")]

# init fingerprint and duplicate finder
fingerprintBuilder = CharFingerprintBuilder(fingerprintLength=15)
duplicateFinder = DuplicateFinder(fingerprintBuilder, minDuplicateLength=15)

# call findDuplicates() on each file
for i, text in enumerate(texts):
    id = f"D{i}"
    duplicates = duplicateFinder.findDuplicates(id, text)
    for duplicate in duplicates:
        print(
            f"sourceDoc={duplicate.sourceDocId}, "
            f"sourceStart={duplicate.sourceSpan.start}, "
            f"sourceEnd={duplicate.sourceSpan.end}, "
            f"targetStart={duplicate.targetSpan.start}, "
            f"targetEnd={duplicate.targetSpan.end}"
        )
        duplicated_text = text[duplicate.targetSpan.start : duplicate.targetSpan.end]
        print(duplicated_text)

WordFingerprintBuilder can be used instead of CharFingerprintBuilder. For more details, refer to the docstrings of DuplicateFinder, CharFingerprintBuilder and WordFingerprintBuilder.

How to run tests

  1. Install package in editable mode with test and extra dependencies by running pip install -e ".[tests, ncls, intervaltree]" in the repo directory
  2. Launch pytest tests/

About ncls and intervaltree

This tool can be used without any additional dependencies, but performance can be improved when using interval trees. To benefit from this you well need to install either the ncls package or the intervaltree package.

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

duptextfinder-0.3.0.tar.gz (19.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

duptextfinder-0.3.0-py3-none-any.whl (17.2 kB view details)

Uploaded Python 3

File details

Details for the file duptextfinder-0.3.0.tar.gz.

File metadata

  • Download URL: duptextfinder-0.3.0.tar.gz
  • Upload date:
  • Size: 19.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for duptextfinder-0.3.0.tar.gz
Algorithm Hash digest
SHA256 a8ff3f3128bdc56157b2d09778e9ca2d73b093fea5419947418c07f83eaba08e
MD5 d4e1d32863bedf4450d62748fed6fa7f
BLAKE2b-256 67668796a3b0156aa70584e768bd66002a219eec868c215b9c83a90e28d26c5b

See more details on using hashes here.

File details

Details for the file duptextfinder-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: duptextfinder-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 17.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for duptextfinder-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b23072f5839a71240c43e288902024ee09a4491a64eb102619e4028f0253e37d
MD5 5fc4d746e4823e999cc8052541f37e4e
BLAKE2b-256 c650c45b26f67e3301efeb9cfd11b7509f18c794439cbfffe250392380f11e1d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page