Make plagiarism detection easier. This package will find similar sentences between given files and highlight them in a side by side comparison.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Copy Spotter

GIF demo

About

This program will proccess pdf, txt, docx, and txt files that can be found in the given input directory, find similar sentences, calculate similarity percentage, display a similarity table with links to side by side comparison where similar sentences are highlighted.

This project was made part of my internship at the "Human Computer Humans Interacting with Computers at University of Primorska" lab (HICUP Lab).

Usage

Usage: python -m scripts.main.py input_directory [OPTIONS]

  Performs a similarity analysis of all text files available in given input directory.
  Developed by Clément Delteil -> (Github: Wazzabeee)

Options:
  -block_size, -s  Set minimum number of consecutive and similar words detected. (Default is 2)
  -out_dir, -o     Set the output directory for html files. (Default is creating a new directory)
  -help, -h        Show this message and exit.

How to use

# Clone this repository
$ git clone https://github.com/Wazzabeee/copy_spotter

# Go into the repository
$ cd copy_spotter

# Install requirements
$ pip install -r requirements.txt

# Run the app
$ python -m scripts.main.py data/pdf/plagiarism -s 2

First run

On the first run you might get :

an ImportError from pdfminer library

ImportError: cannot import name 'uint_value' from 'pdfminer.pdftypes' (C:/.../pdfminer/pdftypes.py)

To fix this, please uninstall pdfminer3k and pdfminer.six via pip uninstall pdfminer3k pip uninstall pdfminer.six Then install them again via pip install pdfminer3k pip install pdfminer.six

a TypeError from Slate3k library

TypeError __init__() missing 1 required positional arg 'parser' in "C:/.../slate3k/classes.py

To fix this you'll need to modify class PDF(list): in C:/.../slate3k/classes.py. In def __init__() change both if PYTHON 3:
to if not PYTHON 3: on lines 58 and 72.

Recommandations

Please make sure that all text files are closed before running the program.
In order to get the best results please provide text files of the same languages.
Pdf files that are made from scanned images won't be processed correctly.
If a specific file is not processed correctly feel free to contact me so that I can address the issue.

TODO

Add more tests
Add info in console for timing (tqdm)
Add CSS to HTML Template
Add support for other folder structures
Fix Slate3k by installing custom fork

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.1.16

May 9, 2024

0.1.15

May 4, 2024

0.1.14

May 4, 2024

0.1.13

May 4, 2024

0.1.12

May 4, 2024

0.1.9

May 4, 2024

0.1.8

May 4, 2024

0.1.7

May 4, 2024

0.1.6

May 4, 2024

0.1.5

May 4, 2024

0.1.4

May 1, 2024

0.1.3

Apr 26, 2024

0.1.2

Apr 24, 2024

This version

0.1.1

Apr 24, 2024

0.1.0

Apr 21, 2024

0.0.1

Apr 21, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

copy-spotter-0.1.1.tar.gz (12.5 kB view hashes)

Uploaded Apr 24, 2024 Source

Built Distribution

copy_spotter-0.1.1-py3-none-any.whl (14.8 kB view hashes)

Uploaded Apr 24, 2024 Python 3

Hashes for copy-spotter-0.1.1.tar.gz

Hashes for copy-spotter-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`92d709e56f4b01c9b53268ea3bf0b1d0c1d5d8bcc5b3ea9a6eaf165aefeb8841`
MD5	`a3bf2ce9f4e477e6e76c40b4ef82cf49`
BLAKE2b-256	`6c0b3ea2a703ee530376f9db51fb4e832decb0464429527b87c6d30220c38350`

Hashes for copy_spotter-0.1.1-py3-none-any.whl

Hashes for copy_spotter-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`337b006760b8e7f328037e0db6a6f589ab631b220447937624c5437b12cdccb6`
MD5	`6322888a55f1fb2f763e6b61ddfa1339`
BLAKE2b-256	`41feba311339d63b92dee9c8418ed51be6fb2dc280775b4ebee20289fd3788ab`