Skip to main content

Calculates percentage of N-Tuples from file1 found in another file2

Project description

Calculates percentage of N-Tuples from file1 found in another file2:

from plagiarismdetector.detector import Detector

print Detector.detect(synonyms_file_path,
                       eval_file_path,
                       source_file_path,
                       n_tuples_value=3)

Running

python plagiarismdetector/main.py synonyms_file_path eval_file_path source_file_path 3

Assumptions & Overview

  • Dependent on Python2.7

  • Tokenizer is only adapted to English language text using Penn TreeBank tokenizer, reason being it divides strings based off structures in english language that might fail in other languages e.g. in Hindi since sentence separators and punctuations are entirely different.

  • The module is optimized to be as fast as possible, some of the optimizations are:
    • Only n-grams for file 2 are generated and stored, file 1 n-tuples are generated but not stored.

    • Not holding generated n-grams in memory, a generator is used

    • Dictionary of n-grams is created from file2 n-grams for constant time lookup of file1 tuples

    • Keys in file2 n-gram dictionary contains hashes for tuples instead of actual tuples to reduce space complexity.

    • Since we are only concerned with percentage of file1 n-tuples found in file2, we do not need to store any tuples. Therefore, we first generate n-grams for file2 and then calculate count for file1 on the fly instead of generating all n-tuples of file1 and cross referencing it with those of file2.

Testing

python -m unittest discover tests

Help

python plagiarismdetector/main.py -h

positional arguments

synonym_file_path

Path to file to be used for synonyms

evaluation_file_path

Path to file to be evaluated

source_file_path

Path to file to be used as source for matching

n-tuples

Number of N-tuples, Optional and Defaults to 3

optional arguments

-h, --help

show this help message and exit

Example

Returns

100.0

Evaluation File

go for a run

Source File

go for a jog

N-tuples

3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PlagiarismDetector-0.2.2.tar.gz (6.7 kB view details)

Uploaded Source

File details

Details for the file PlagiarismDetector-0.2.2.tar.gz.

File metadata

File hashes

Hashes for PlagiarismDetector-0.2.2.tar.gz
Algorithm Hash digest
SHA256 0c90c9c1bfbadc5f82773f7afeaf6b7e12b6164bc7446bfa4af120834c9a580e
MD5 71714c5c981eb6e1616b87045edf31ee
BLAKE2b-256 1da09c75367c9182ecb23e05e8ca90ddf32c664d6ed645aa9200f34b806722df

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page