Calculates percentage of N-Tuples from file1 found in another file2
Project description
===========
Plagiarism Detector
===========
Calculates percentage of N-Tuples from file1 found in another file2
from plagiarismdetector.detector import Detector
print Detector.dectect(synonyms_file_name, eval_file_name, source_file_name, n_tuples_value)
Assumptions & Overview
=========
* English language text, reason being it divides strings based off that might fail for e.g. in Hindi since sentence
separators and punctuations are entirely different.
* Used Penn TreeBank tokenizer for splitting up strings
* The module is optimzed to be as fast as possible, some of the optimizations are:
* Only n-grams for file 2 are generated and stored, file 1 n-tuples are generated but not stored.
* Not holding generated n-grams in memory, a generator is used
* Dictionary of n-grams is created from file2 n-grams for constant time lookup of file1 tuples
* Keys in file2 n-gram dictionary contains hashes for tuples instead of actual tuples to reduce space complexity.
* Since we are only concerned with percentage of file1 n-tuples found in file2, we do not need to store any tuples.
Therefore, we first generate n-grams for file2 and then calculate count for file1 on the fly instead of generating
all n-tuples of file1 and cross referencing it with those of file2.
Testing
=========
`python -m unittest discover tests`
Help
=========
`python main.py -h`
positional arguments
--------------------
synonym_file_path Path to file to be used for synonyms
evaluation_file_path Path to file to be evaluated
source_file_path Path to file to be used as source for matching
n-tuples Number of N-tuples, Optional and Defaults to 3
optional arguments
------------------
-h, --help show this help message and exit
Example
=========
Returns 100.0
Synonyms run jog spring
Evaluation File go for a run
Source File go for a jog
N-tuples 3
Plagiarism Detector
===========
Calculates percentage of N-Tuples from file1 found in another file2
from plagiarismdetector.detector import Detector
print Detector.dectect(synonyms_file_name, eval_file_name, source_file_name, n_tuples_value)
Assumptions & Overview
=========
* English language text, reason being it divides strings based off that might fail for e.g. in Hindi since sentence
separators and punctuations are entirely different.
* Used Penn TreeBank tokenizer for splitting up strings
* The module is optimzed to be as fast as possible, some of the optimizations are:
* Only n-grams for file 2 are generated and stored, file 1 n-tuples are generated but not stored.
* Not holding generated n-grams in memory, a generator is used
* Dictionary of n-grams is created from file2 n-grams for constant time lookup of file1 tuples
* Keys in file2 n-gram dictionary contains hashes for tuples instead of actual tuples to reduce space complexity.
* Since we are only concerned with percentage of file1 n-tuples found in file2, we do not need to store any tuples.
Therefore, we first generate n-grams for file2 and then calculate count for file1 on the fly instead of generating
all n-tuples of file1 and cross referencing it with those of file2.
Testing
=========
`python -m unittest discover tests`
Help
=========
`python main.py -h`
positional arguments
--------------------
synonym_file_path Path to file to be used for synonyms
evaluation_file_path Path to file to be evaluated
source_file_path Path to file to be used as source for matching
n-tuples Number of N-tuples, Optional and Defaults to 3
optional arguments
------------------
-h, --help show this help message and exit
Example
=========
Returns 100.0
Synonyms run jog spring
Evaluation File go for a run
Source File go for a jog
N-tuples 3
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.