Skip to main content

Check for duplicate word series between generated text and a dataset

Project description

Dupecheck

Analyze your generated text for plagiarizing your dataset

Use this to prevent "Thomas Riker" situations, when generating text with machine learning systems.

At Eclectic Beams, when generating stories from finetuned GPT-2 models, I kept finding nice stories, but whole segments of them were just copied verbatim from the dataset text. I started battling this by using my editor's "find" feature on each line of generated text, or random subsequences against the dataset. It was time-consuming, and caused much eye pain and frustration, each time I'd read a story I liked, only to find out it had large sections copied verbatim.

Time to automate the tedious away.

Dupecheck will search for substrings of given length in a larger text dataset for you. Use this to validate uniqueness before saving generated text, and you'll save yourself a huge headache.

It compares word series without punctuation.

pip install dupecheck
from dupecheck import chunks, sliding_window, sliding_window2

Chunking method

Find word series at least 5 words long in dataset. (Prone to mid-pattern window-splitting false negatives, since the window doesn't slide. Your pattern might get cut and split between two chunks, then not found even if it should match)

duplicates = chunks.dupecheck(
              min=5,
              max=10, 
              text=my_gen_text, 
              dataset=my_dataset, 
              verbose=False
            )

Sliding window find, try 1

  • find word series at least 5 words long of your text string in dataset string
duplicates = sliding_window.dupecheck(
              min=5, 
              max=10, 
              text=my_gen_text, 
              dataset=my_dataset,
            )

Sliding window find try 2

  • find word series at least 5 words long of your text string in dataset string.
  • cleans text and dataset for you if you don't provide cleaned and specify cleaned=True.
  • Cleaning is currently VERY slow.
duplicates = sliding_window2.dupecheck(
              min=5, 
              max=10, 
              text=my_gen_text, 
              dataset=my_dataset,
              cleaned=False,
              verbose=False
            )

Preprocessing helpers

from dupecheck.preprocess import pre_process_text

dataset = pre_process_text(dataset)
text = pre_process_text(text)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dupecheck-0.7.1.tar.gz (9.2 kB view details)

Uploaded Source

Built Distribution

dupecheck-0.7.1-py3-none-any.whl (10.2 kB view details)

Uploaded Python 3

File details

Details for the file dupecheck-0.7.1.tar.gz.

File metadata

  • Download URL: dupecheck-0.7.1.tar.gz
  • Upload date:
  • Size: 9.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.5

File hashes

Hashes for dupecheck-0.7.1.tar.gz
Algorithm Hash digest
SHA256 07cf90005b87d76dd1d827f306d9d6e40df4131026f1dca1237d738e58dfb448
MD5 c56e184fd79d751b2867d345a4d807fb
BLAKE2b-256 f78db647b97b4c7ffae731e53535ed1802a154d4cb13ba7aeb320732dd3111c3

See more details on using hashes here.

File details

Details for the file dupecheck-0.7.1-py3-none-any.whl.

File metadata

  • Download URL: dupecheck-0.7.1-py3-none-any.whl
  • Upload date:
  • Size: 10.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.5

File hashes

Hashes for dupecheck-0.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 96a196de60c343e998426645858efe29f97612ec9329255241f6eb091a585162
MD5 3cf58f35fac19bdc7abcbedbd168111c
BLAKE2b-256 568c5af92a372c0ab11706ccb3015ccd4adb3edafcd6678b66344ee73780d1a2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page