Skip to main content

Make plagiarism detection easier. This package will find similar sentences between given files and highlight them in a side by side comparison.

Project description

Copy Spotter

PyPI - Version PyPI - License Python

GIF demo

About

This program will process pdf, txt, docx, and odt files that can be found in the given input directory, find similar sentences, calculate similarity percentage, display a similarity table with links to side by side comparison where similar sentences are highlighted.

Usage

$ pip install copy-spotter
$ copy-spotter [-s] [-o] [-h] input_directory

Positional Arguments:

  • input_directory: One directory that contains all files (pdf, txt, docx, odt) (see data/pdf/plagiarism for example)
input_directory/
│
├── file_1.docx
├── file_2.pdf
└── file_3.pdf

Optional Arguments:

  • -s, --block-size: Set minimum number of consecutive and similar words detected. (Default is 2)
  • -o, --out_dir: Set the output directory for html files. (Default is creating a new directory called results)
  • -h, --help: Show this message and exit.

Examples

# Analyze documents in 'data/pdf/plagiarism', with default settings
$ copy-spotter data/pdf/plagiarism

# Analyze with custom block size and specify output directory
$ copy-spotter data/pdf/plagiarism -s 5 -o results/output

Development Setup:

# Clone this repository
$ git clone https://github.com/Wazzabeee/copy_spotter

# Go into the repository
$ cd copy_spotter

# Install requirements
$ pip install -r requirements.txt
$ pip install -r requirements_lint.txt

# Install precommit
$ pip install pre-commit
$ pre-commit install

# Run tests
$ pip install pytest
$ pytest tests/

# Run package locally
$ python -m scripts.main [-s] [-o] [-h] input_directory

Recommandations

  • Please make sure that all text files are closed before running the program.
  • In order to get the best results please provide text files of the same languages.
  • Pdf files that are made from scanned images won't be processed correctly.
  • Ensure you have writing access when using the package
  • If a specific file is not processed correctly feel free to contact me so that I can address the issue.

TODO

  • Add more tests on existing functions
  • Implement OCR with tesseract for scanned documents
  • Add custom naming option for pdf files

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

copy-spotter-0.1.16.tar.gz (12.4 kB view details)

Uploaded Source

Built Distribution

copy_spotter-0.1.16-py3-none-any.whl (15.2 kB view details)

Uploaded Python 3

File details

Details for the file copy-spotter-0.1.16.tar.gz.

File metadata

  • Download URL: copy-spotter-0.1.16.tar.gz
  • Upload date:
  • Size: 12.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.5

File hashes

Hashes for copy-spotter-0.1.16.tar.gz
Algorithm Hash digest
SHA256 d3051ebb7d5f3384b7c2279d92520f1a6f7d2e9f0a3ed8333ec3136011dacd13
MD5 76d85bb96e65450a3aaca18a2bd12708
BLAKE2b-256 d8ad4d9ebc88ccd251829d1187cc5248aab12fa9192e24553905caa0b90dd9e9

See more details on using hashes here.

File details

Details for the file copy_spotter-0.1.16-py3-none-any.whl.

File metadata

File hashes

Hashes for copy_spotter-0.1.16-py3-none-any.whl
Algorithm Hash digest
SHA256 de87eef4e86489ab4c19c08280479fd5cb66b610c3bd548ada2a0c770751e3b3
MD5 ce5559d63b024ab397e3c2b83a9d7005
BLAKE2b-256 59e4ce25adf112a925b259ecdfde91f19c36c7e7b5d897af7e0751677dfb64cf

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page