Compare sentences from input document with all sentences from reference documents - find very similar ones.
Project description
Plagiarism Checker
This is a command-line tool for checking the similarity between a given text and a set of reference documents. The tool uses the Jaccard similarity algorithm to compare the input text with the reference documents.
Installation
Install in an isolated environment using pipx (or normal pip):
pipx install sentence-plagiarism
CLI Usage
To run the plagiarism checker, use the following command:
sentence-plagiarism <path-to-input-file> <path-to-reference-file-1> <path-to-reference-file-2> ... [--threshold <threshold-value>] [--output_file <path-to-output-file>] [--quiet]
<path-to-input-file>
: Path to the input file to be checked for plagiarism.<path-to-reference-file-1> ...
: Paths to the reference files to compare against.--threshold
: (optional) The minimum similarity score required to consider a sentence as plagiarized. The value should be between 0 and 1.--output-file
(optional): Path to the output file to save the results in JSON format.--quiet
(optional): Flag to suppress the display of similar sentences in the console.
Example
The following command:
sentence-plagiarism input.txt --reference-files ref1.txt ref2.txt --similarity-threshold 0.8 --output-file results.json
can produce the following output on stdout:
Input Sentence: The retriever and seq2seq modules commence their operations as pretrained models, and through a joint fine-tuning process, they adapt collaboratively, thus enhancing both retrieval and generation for specific downstream tasks.
Reference Sentence: foobar The retriever and seq2seq modules commence their operations as pretrained models, and through a joint fine-tuning process, they adapt collaboratively, thus enhancing both retrieval and generation for specific downstream tasks.
Reference Document: ref1.txt
Similarity Score: 0.9667
Input Sentence: Closing thoughts For a comprehensive understanding of the RAG technique, we offer an in-depth exploration, commencing with a simplified overview and progressively delving into more intricate technical facets.
Reference Sentence: barfoo For a comprehensive understanding of the RAG technique, we offer an in-depth exploration, commencing with a simplified overview and progressively delving into more intricate technical facets.
Reference Document: ref2.txt
Similarity Score: 0.8966
Results saved to results.json
and save results to results.json
.
Programmatic Usage
from sentence_plagiarism import check
check(
examined_file="txt/txt1.txt",
reference_files=["txt/txt2.txt", "txt/txt3.txt"],
similarity_threshold=0.8,
output_file=None,
quiet=False,
)
License
Distributed under the MIT License. See LICENSE
for more information.
Contact
Krystian Safjan - ksafjan@gmail.com
Project Link: https://github.com/izikeros/sentence-plagiarism
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file sentence_plagiarism-0.3.0.tar.gz
.
File metadata
- Download URL: sentence_plagiarism-0.3.0.tar.gz
- Upload date:
- Size: 4.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.1 CPython/3.11.4 Darwin/22.5.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b439e1682525355deb31ca02e3bd7977e92020ef9648b5093d278f4a5c3e778c |
|
MD5 | 3e5ca7cf7763cb2db623462b9237eed3 |
|
BLAKE2b-256 | b27f274131b0750b9ecddd300c87f140572e1973d5910a73bc62b0279ac3beec |
File details
Details for the file sentence_plagiarism-0.3.0-py3-none-any.whl
.
File metadata
- Download URL: sentence_plagiarism-0.3.0-py3-none-any.whl
- Upload date:
- Size: 6.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.1 CPython/3.11.4 Darwin/22.5.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 54beb049d43f97f1f135d8dd17bed30dc398a9b6ae71d4df205c3e79c3263554 |
|
MD5 | a61a52e9690edbc623c9d60134dee9ea |
|
BLAKE2b-256 | 196dbc8f3159384584523d38ec6aaca0c35baab5493fe9925b69d5a00d44acc6 |