Make plagiarism detection easier. This package will find similar sentences between given files and highlight them in a side by side comparison.
Project description
Copy Spotter
About
This program will process pdf, txt, docx, and odt files that can be found in the given input directory, find similar sentences, calculate similarity percentage, display a similarity table with links to side by side comparison where similar sentences are highlighted.
Usage
$ pip install copy-spotter
$ copy-spotter [-s] [-o] [-h] input_directory
Positional Arguments:
input_directory
: One directory that contains all files (pdf, txt, docx, odt) (seedata/pdf/plagiarism
for example)
input_directory/
│
├── file_1.docx
├── file_2.pdf
└── file_3.pdf
Optional Arguments:
-s
,--block-size
: Set minimum number of consecutive and similar words detected. (Default is 2)-o
,--out_dir
: Set the output directory for html files. (Default is creating a new directory called results)-h
,--help
: Show this message and exit.
Examples
# Analyze documents in 'data/pdf/plagiarism', with default settings
$ copy-spotter data/pdf/plagiarism
# Analyze with custom block size and specify output directory
$ copy-spotter data/pdf/plagiarism -s 5 -o results/output
Development Setup:
# Clone this repository
$ git clone https://github.com/Wazzabeee/copy_spotter
# Go into the repository
$ cd copy_spotter
# Install requirements
$ pip install -r requirements.txt
$ pip install -r requirements_lint.txt
# Install precommit
$ pip install pre-commit
$ pre-commit install
# Run tests
$ pip install pytest
$ pytest tests/
# Run package locally
$ python -m scripts.main [-s] [-o] [-h] input_directory
Recommandations
- Please make sure that all text files are closed before running the program.
- In order to get the best results please provide text files of the same languages.
- Pdf files that are made from scanned images won't be processed correctly.
- Ensure you have writing access when using the package
- If a specific file is not processed correctly feel free to contact me so that I can address the issue.
TODO
- Add more tests on existing functions
- Implement OCR with tesseract for scanned documents
- Add custom naming option for pdf files
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
copy-spotter-0.1.16.tar.gz
(12.4 kB
view details)
Built Distribution
File details
Details for the file copy-spotter-0.1.16.tar.gz
.
File metadata
- Download URL: copy-spotter-0.1.16.tar.gz
- Upload date:
- Size: 12.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d3051ebb7d5f3384b7c2279d92520f1a6f7d2e9f0a3ed8333ec3136011dacd13 |
|
MD5 | 76d85bb96e65450a3aaca18a2bd12708 |
|
BLAKE2b-256 | d8ad4d9ebc88ccd251829d1187cc5248aab12fa9192e24553905caa0b90dd9e9 |
File details
Details for the file copy_spotter-0.1.16-py3-none-any.whl
.
File metadata
- Download URL: copy_spotter-0.1.16-py3-none-any.whl
- Upload date:
- Size: 15.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | de87eef4e86489ab4c19c08280479fd5cb66b610c3bd548ada2a0c770751e3b3 |
|
MD5 | ce5559d63b024ab397e3c2b83a9d7005 |
|
BLAKE2b-256 | 59e4ce25adf112a925b259ecdfde91f19c36c7e7b5d897af7e0751677dfb64cf |