Check for duplicate word series between generated text and a dataset

Project description

Dupecheck

Use this to prevent "Thomas Riker" situations, when generating text with machine learning systems.

At Eclectic Beams, when generating stories from finetuned GPT-2 models, I kept finding nice stories, but whole segments of them were just copied verbatim from the dataset text. I started battling this by using my editor's "find" feature on each line of generated text, or random subsequences against the dataset. It was time-consuming, and caused much eye pain and frustration, each time I'd read a story I liked, only to find out it had large sections copied verbatim.

Time to automate the tedious away.

Dupecheck will search for substrings of given length in a larger text dataset for you. Use this to validate uniqueness before saving generated text, and you'll save yourself a huge headache.

It compares word series without punctuation.

pip install dupecheck

import sliding_window from dupecheck

...
# find word series at least 5 words long of your text string in dataset string
duplicates = sliding_window.dupecheck(
              min=5, 
              max=10, 
              text=my_gen_text, 
              dataset=my_dataset,
            )

Project details

Release history Release notifications | RSS feed

0.7.1

Feb 9, 2021

0.7.0

Feb 9, 2021

0.6.4

Feb 1, 2021

This version

0.6.3

Feb 1, 2021

0.6.2

Feb 1, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dupecheck-0.6.3.tar.gz (3.4 kB view hashes)

Uploaded Feb 1, 2021 Source

Built Distribution

dupecheck-0.6.3-py3-none-any.whl (4.7 kB view hashes)

Uploaded Feb 1, 2021 Python 3

Hashes for dupecheck-0.6.3.tar.gz

Hashes for dupecheck-0.6.3.tar.gz
Algorithm	Hash digest
SHA256	`6baa04024a606619c2ab302cc5ccd77af317f6953d881824583e276d68d8c971`
MD5	`1a822c74147c3681cde9dc21100fbad1`
BLAKE2b-256	`22a6265c9e18bd965dd185e5656b1069dd2490413adbfd2507ce05af48e51de9`

Hashes for dupecheck-0.6.3-py3-none-any.whl

Hashes for dupecheck-0.6.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`77af2d9ee035d1d0aa842a8cd72c27f7912b54b324118e17beb8c64fc17f0eec`
MD5	`ac458347df76515bcc38affda649c4f3`
BLAKE2b-256	`09da9db7e1354abe1addb975967df7219e02370b68ba1bfa20c4c375ad9c0323`