Check for duplicate word series between generated text and a dataset
Project description
Dupecheck
Use this to prevent "Thomas Riker" situations, when generating text with machine learning systems.
At Eclectic Beams, when generating stories from finetuned GPT-2 models, I kept finding nice stories, but whole segments of them were just copied verbatim from the dataset text. I started battling this by using my editor's "find" feature on each line of generated text, or random subsequences against the dataset. It was time-consuming, and caused much eye pain and frustration, each time I'd read a story I liked, only to find out it had large sections copied verbatim.
Time to automate the tedious away.
Dupecheck will search for substrings of given length in a larger text dataset for you. Use this to validate uniqueness before saving generated text, and you'll save yourself a huge headache.
It compares word series without punctuation.
pip install dupecheck
import sliding_window from dupecheck
...
# find word series at least 5 words long of your text string in dataset string
duplicates = sliding_window.dupecheck(
min=5,
max=10,
text=my_gen_text,
dataset=my_dataset,
)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for dupecheck-0.6.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 77af2d9ee035d1d0aa842a8cd72c27f7912b54b324118e17beb8c64fc17f0eec |
|
MD5 | ac458347df76515bcc38affda649c4f3 |
|
BLAKE2b-256 | 09da9db7e1354abe1addb975967df7219e02370b68ba1bfa20c4c375ad9c0323 |