Skip to main content

Compare sentences from input document with all sentences from reference documents - find very similar ones.

Project description

Plagiarism Checker

img

This is a command-line tool for checking the similarity between a given text and a set of reference documents. The tool uses the Jaccard similarity algorithm to compare the input text with the reference documents.

Installation

Install in an isolated environment using pipx (or normal pip):

pipx install sentence-plagiarism

CLI Usage

To run the plagiarism checker, use the following command:

sentence-plagiarism <path-to-input-file> <path-to-reference-file-1> <path-to-reference-file-2> ... [--threshold <threshold-value>] [--output_file <path-to-output-file>] [--quiet]
  • <path-to-input-file>: Path to the input file to be checked for plagiarism.
  • <path-to-reference-file-1> ...: Paths to the reference files to compare against.
  • --threshold: (optional) The minimum similarity score required to consider a sentence as plagiarized. The value should be between 0 and 1.
  • --output-file (optional): Path to the output file to save the results in JSON format.
  • --quiet (optional): Flag to suppress the display of similar sentences in the console.

Example

The following command:

sentence-plagiarism  input.txt --reference-files ref1.txt ref2.txt --similarity-threshold 0.8 --output-file results.json

can produce the following output on stdout:

Input Sentence:     The retriever and seq2seq modules commence their operations as pretrained models, and through a joint fine-tuning process, they adapt collaboratively, thus enhancing both retrieval and generation for specific downstream tasks.
Reference Sentence:  foobar  The retriever and seq2seq modules commence their operations as pretrained models, and through a joint fine-tuning process, they adapt collaboratively, thus enhancing both retrieval and generation for specific downstream tasks.
Reference Document: ref1.txt
Similarity Score: 0.9667

Input Sentence:      Closing thoughts  For a comprehensive understanding of the RAG technique, we offer an in-depth exploration, commencing with a simplified overview and progressively delving into more intricate technical facets.
Reference Sentence:  barfoo  For a comprehensive understanding of the RAG technique, we offer an in-depth exploration, commencing with a simplified overview and progressively delving into more intricate technical facets.
Reference Document: ref2.txt
Similarity Score: 0.8966

Results saved to results.json

and save results to results.json.

Programmatic Usage

from sentence_plagiarism import check

check(
    examined_file="txt/txt1.txt",
    reference_files=["txt/txt2.txt", "txt/txt3.txt"],
    similarity_threshold=0.8,
    output_file=None,
    quiet=False,
)

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Krystian Safjan - ksafjan@gmail.com

Project Link: https://github.com/izikeros/sentence-plagiarism

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sentence_plagiarism-0.3.0.tar.gz (4.7 kB view details)

Uploaded Source

Built Distribution

sentence_plagiarism-0.3.0-py3-none-any.whl (6.3 kB view details)

Uploaded Python 3

File details

Details for the file sentence_plagiarism-0.3.0.tar.gz.

File metadata

  • Download URL: sentence_plagiarism-0.3.0.tar.gz
  • Upload date:
  • Size: 4.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.4 Darwin/22.5.0

File hashes

Hashes for sentence_plagiarism-0.3.0.tar.gz
Algorithm Hash digest
SHA256 b439e1682525355deb31ca02e3bd7977e92020ef9648b5093d278f4a5c3e778c
MD5 3e5ca7cf7763cb2db623462b9237eed3
BLAKE2b-256 b27f274131b0750b9ecddd300c87f140572e1973d5910a73bc62b0279ac3beec

See more details on using hashes here.

File details

Details for the file sentence_plagiarism-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for sentence_plagiarism-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 54beb049d43f97f1f135d8dd17bed30dc398a9b6ae71d4df205c3e79c3263554
MD5 a61a52e9690edbc623c9d60134dee9ea
BLAKE2b-256 196dbc8f3159384584523d38ec6aaca0c35baab5493fe9925b69d5a00d44acc6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page