Skip to main content

Check for duplicate word series between generated text and a dataset

Project description

Dupecheck

Use this to prevent "Thomas Riker" situations, when generating text with machine learning systems.

At Eclectic Beams, when generating stories from finetuned GPT-2 models, I kept finding nice stories, but whole segments of them were just copied verbatim from the dataset text. I started battling this by using my editor's "find" feature on each line of generated text, or random subsequences against the dataset. It was time-consuming, and caused much eye pain and frustration, each time I'd read a story I liked, only to find out it had large sections copied verbatim.

Time to automate the tedious away.

Dupecheck will search for substrings of given length in a larger text dataset for you. Use this to validate uniqueness before saving generated text, and you'll save yourself a huge headache.

It compares word series without punctuation.

pip install dupecheck
import sliding_window from dupecheck

...
# find word series at least 5 words long of your text string in dataset string
duplicates = sliding_window.dupecheck(
              min=5, 
              max=10, 
              text=my_gen_text, 
              dataset=my_dataset,
            )

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dupecheck-0.6.3.tar.gz (3.4 kB view hashes)

Uploaded Source

Built Distribution

dupecheck-0.6.3-py3-none-any.whl (4.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page