Check for duplicate word series between generated text and a dataset
Project description
Dupecheck
Use this to prevent "Thomas Riker" situations, when generating text with machine learning systems.
At Eclectic Beams, when generating stories from finetuned GPT-2 models, I kept finding nice stories, but whole segments of them were just copied verbatim from the dataset text. I started battling this by using my editor's "find" feature on each line of generated text, or random subsequences against the dataset. It was time-consuming, and caused much eye pain and frustration, each time I'd read a story I liked, only to find out it had large sections copied verbatim.
Time to automate the tedious away.
Dupecheck will search for substrings of given length in a larger text dataset for you. Use this to validate uniqueness before saving generated text, and you'll save yourself a huge headache.
It compares word series without punctuation.
pip install dupecheck
import sliding_window from dupecheck
...
# find word series at least 5 words long of your text string in dataset string
duplicates = sliding_window.dupecheck(
min=5,
max=10,
text=my_gen_text,
dataset=my_dataset,
)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for dupecheck-0.7.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 56a323251d16e6048f83efaa28a6f2882767701bd51eb1979e0b8ae0179b18a6 |
|
MD5 | de8b1d0e9a33d2924b728449ba2e3971 |
|
BLAKE2b-256 | a7cdd83aa7dbabb1bd62a097ae51a84692aa720ec79f9f0597077ec9fd9150e5 |