Compare sentences from input document with all sentences from reference documents - find very similar ones.

These details have not been verified by PyPI

Project links

Project description

Sentence Plagiarism Checker

A tool to compare sentences from an input document with all sentences from reference documents to find similar content.

PyPI Version Python Versions Downloads Code Coverage

Overview

A command-line tool for detecting sentence-level plagiarism using the Jaccard similarity algorithm. This tool allows users to compare an input document against multiple reference documents and identify similar sentences.

Features

Detects sentence-level plagiarism using Jaccard similarity.
Configurable similarity threshold.
Filters sentences by minimum length.
Outputs results in text and JSON format.
Quiet mode to suppress console output.
Interactive HTML visualization of plagiarized content.

Text Splitting

The tool splits text into sentences using intelligent sentence boundary detection:

Uses regex pattern to identify sentence endings (periods, question marks, exclamation points)
Avoids splitting abbreviations (e.g., "e.g.", "Dr.") or initials (e.g., "A.B.")
Tracks sentence positions within the original document for accurate reporting
Filters sentences by minimum length for more relevant comparisons

Supported Similarity Metrics

The tool supports several similarity metrics for comparing sentences:

Jaccard Similarity (default): Measures similarity based on the size of the intersection divided by the size of the union of word sets
Cosine Similarity: Measures the cosine of the angle between word frequency vectors
Jaro Similarity: String-based similarity measure that accounts for character matches and transpositions
Jaro-Winkler Similarity: An extension of Jaro similarity that gives higher weights to matches at the beginning of the strings
Overlap Similarity: Measures the overlap between two sets divided by the size of the smaller set
Sørensen-Dice Similarity: Calculates similarity as twice the number of common terms divided by the sum of the cardinalities
Tversky Similarity: An asymmetric similarity measure that extends Jaccard similarity with parameters for weighting differences

Installation

Install in an isolated environment using pipx:

pipx install sentence-plagiarism

CLI Usage

sentence-plagiarism <path-to-input-file> <path-to-reference-file-1> ... [--threshold <threshold-value>] [--output_file <path-to-output-file>] [--quiet] [--min_length <min-length>]

Arguments

<input_file>: Path to the input file to be checked for plagiarism.
<reference_files>: Paths to one or more reference files to compare against.
--threshold, -t: (optional) Minimum similarity score (0-1) to consider a sentence plagiarized. Default: 0.8.
--output, -o: (optional) Path to save results in JSON format. Default: results.json.
--text_output, -to: (optional) Path to save results in text format.
--quiet, -q: (optional) Suppress console output.
--min_length, -ml: (optional) Minimum sentence length to include in the comparison. Default: 10.
--metric, -m: (optional) Similarity metric to use for comparison. Options: jaccard_similarity, cosine_similarity, sorensen_dice_similarity, overlap_similarity, tversky_similarity, jaro_similarity, jaro_winkler_similarity. Default: jaccard_similarity.

Example

sentence-plagiarism input.txt ref1.txt ref2.txt --threshold 0.8 --output results.json --min_length 10 --metric jaccard_similarity

Visualization

The tool includes a powerful visualization capability that creates interactive HTML reports for easier plagiarism analysis.

Plagiarism Visualization

CLI Visualization Usage

python -m sentence_plagiarism.plagiarism_visualizer --input <input-markdown-file> --plagiarism-data <json-results-file> --output <output-html-file>

Visualization Features

Color-coded highlighting of plagiarized content
Interactive filters to show/hide matches from different reference documents
Hover tooltips showing matching reference document and similarity score
Document legend for easy reference identification
Supports Markdown content with proper rendering
Opacity level indicating similarity strength (higher opacity = higher similarity) (TBD)

Programmatic Usage

see ./example.py

from sentence_plagiarism import check
from sentence_plagiarism.visualization.file_handlers import save_html
from sentence_plagiarism.visualization.html_generator import (
    create_html_with_highlights_md,
)

# Basic usage
check(
    examined_file="tests/txt/txt1.txt",
    reference_files=["tests/txt/txt2.txt", "tests/txt/txt3.txt"],
    similarity_threshold=0.8,
    output_file="results.json",
    text_output_file="results.txt",
    quiet=False,
    min_length=10,
    similarity_metric="jaccard_similarity",
)

# Visualization
from sentence_plagiarism.visualization.visualization_utils import (
    generate_document_colors,
)
from sentence_plagiarism.visualization.file_handlers import load_files
from sentence_plagiarism.visualization.html_generator import generate_final_html


markdown_content, plagiarism_matches = load_files(
    markdown_path="tests/txt/txt1.txt", json_path="results.json"
)
doc_colors = generate_document_colors(plagiarism_matches)
html_with_highlights = create_html_with_highlights_md(
    markdown_content, plagiarism_matches, doc_colors
)
final_html = generate_final_html(
    html_content=html_with_highlights,
    doc_colors=doc_colors,
    plagiarism_matches=plagiarism_matches,
    input_file="tests/txt/txt1.txt",
)
save_html(final_html, "plagiarism_report.html")

Testing

Run the test suite using:

pytest

Contributing

Fork the repository.

Clone your fork:

git clone https://github.com/your-username/sentence-plagiarism.git

Install dependencies:
```
pip install -r requirements.txt
```
Run tests:
```
pytest
```

FAQ

Why is my output empty?

Ensure that the sentences in your input and reference files meet the --min_length requirement.

How do I install pipx?

Refer to the pipx documentation for installation instructions.

What are the typical use cases for the supported metrics in the task of sentence plagiarism detection?

Jaccard Similarity: Best for detecting direct word-for-word plagiarism where the order of words isn't crucial. Most effective when comparing technical content where specific terminology must be preserved. Focuses on shared vocabulary between sentences.
Cosine Similarity: Ideal for longer texts where term frequency matters. It can detect plagiarism even when additional words are inserted or the sentence structure is modified, as it focuses on the angular similarity of word frequency vectors rather than exact matches.
Jaro Similarity: Well-suited for detecting typographical errors or minor spelling changes in plagiarized content. This metric is particularly effective for shorter sentences where character-level similarity is important.
Jaro-Winkler Similarity: Preferred when the beginning portions of sentences are more significant than later parts. Especially useful for detecting plagiarism in academic writing where introductory phrases are often preserved while later portions might be paraphrased.
Overlap Similarity: Best when one text is significantly shorter than the other or when dealing with partial matches. It's useful for identifying when a short, distinctive phrase from one document appears within a longer sentence in another.
Sørensen-Dice Similarity: Provides a balanced approach that gives more weight to matching terms than Jaccard. Particularly effective for medical or scientific texts where shared technical terminology is highly indicative of plagiarism.
Tversky Similarity: Offers flexibility through asymmetric weighting, making it ideal when you want to emphasize either precision or recall. Use when checking if a student paper contains content from reference materials (high alpha) or if reference materials contain content from a student submission (high beta).

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Krystian Safjan - ksafjan@gmail.com

Project Link: https://github.com/izikeros/sentence-plagiarism

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.8.0

May 21, 2025

This version

0.7.4

May 21, 2025

0.7.3

May 9, 2025

0.7.2

May 9, 2025

0.7.1

May 7, 2025

0.7.0

May 7, 2025

0.6.0

May 7, 2025

0.5.0

Apr 30, 2025

0.4.1

Apr 30, 2025

0.4.0

Apr 30, 2025

0.3.0

Aug 29, 2023

0.2.0

Aug 29, 2023

0.1.0

Aug 29, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sentence_plagiarism-0.7.4.tar.gz (18.5 kB view details)

Uploaded May 21, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sentence_plagiarism-0.7.4-py3-none-any.whl (21.2 kB view details)

Uploaded May 21, 2025 Python 3

File details

Details for the file sentence_plagiarism-0.7.4.tar.gz.

File metadata

Download URL: sentence_plagiarism-0.7.4.tar.gz
Upload date: May 21, 2025
Size: 18.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.0.1 CPython/3.13.3 Darwin/24.4.0

File hashes

Hashes for sentence_plagiarism-0.7.4.tar.gz
Algorithm	Hash digest
SHA256	`a029cda15cf81f89132c94d21b84e5cad324271e75b47568cdaa77e7d8717af1`
MD5	`644daaf6634b30fdac8d361f65cc9e93`
BLAKE2b-256	`67b9ac25f139d3fc576fb9ac0cdac154cc50a39ccdd0cb45046cd5f11277fbfc`

See more details on using hashes here.

File details

Details for the file sentence_plagiarism-0.7.4-py3-none-any.whl.

File metadata

Download URL: sentence_plagiarism-0.7.4-py3-none-any.whl
Upload date: May 21, 2025
Size: 21.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.0.1 CPython/3.13.3 Darwin/24.4.0

File hashes

Hashes for sentence_plagiarism-0.7.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4699f8cded031d0298ba7232896737f18f5a2e7f9bee36b89bc6b6a85adaa90b`
MD5	`05c3683ec7cca09da074c2f987705e18`
BLAKE2b-256	`48a1c41060c266a955cf2560b37d3a4a3278c965fa1aec0adca8277959802a81`

See more details on using hashes here.

sentence-plagiarism 0.7.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Sentence Plagiarism Checker

Overview

Features

Text Splitting

Supported Similarity Metrics

Installation

CLI Usage

Arguments

Example

Visualization

CLI Visualization Usage

Visualization Features

Programmatic Usage

Testing

Contributing

FAQ

Why is my output empty?

How do I install pipx?

What are the typical use cases for the supported metrics in the task of sentence plagiarism detection?

License

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes