plagiarism detection using MOSS with post-processing
Project description
mousse: plagiarism detection using MOSS with post-processing
mousse is a command-line tool to submit programming assignments to MOSS plagiarism checker.
Beyond submitting, it also performs the following:
- download the report generated by MOSS
- extract amounts of copied lines from it
- compute distances between pairs of projects
- generate heatmaps to visually identify clusters of similar projects
The generated heatmaps are clickable, allowing one to directly look at MOSS report for a pair of projects. Before submission, code is cleaned a little bit to ease reading reports, in particular:
- comments are removed
- multiple files are grouped into a unique file
mousse may also be used as a Python library to use its features separately.
(By the way, mousse if the French word for moss.)
Installation
Run pip install pymousse.
mousse depends on:
- Python 3 (developed with 3.9, may work with earlier versions)
requestsbeautifulsoup4chardettqdmmosspythefuzzpandasnumpyscipyseabornmatplotlib
(By the way, pymousse if pronounced in French as a famous brand of candies called Pimousse, and as name mousse was already used on PyPi, it was natural to name the project like this.)
Using mousse
Overview
mousse operates on a single directory root (specified on the command line) that should be organised as follows:
root/
+- projects/
| +- student_1/
| +- student_2/
| +- ...
+- base/
Directory projects (specified with option -s/--source) holds one directory for each student, where source code will be searched for to be sent to MOSS.
Directory base (specified with -b/--base) is optional, and if used must contain the code that has been provided to students and will not be taken into account for plagiarism detection.
Assuming current working directory is where root is located, mousse can by run as python -m mousse -b base root, or, if no base is provided, just as python -m mousse root.
After running mousse, directory root has been populated with new content:
root/
+- projects/ (untouched)
+- base/ (untouched)
+- moss/
| +- index.html
| +- match-0.html
| +- ...
+- dists.csv
+- dists.pdf
+- dists-pruned.pdf
Directory moss is where MOSS report has been downloaded.
File dists.csv is a CSV file with the distances computed from MOSS' matches: the more source code two projects have in common, the closer they are with respect to this distance.
File dists.pdf is the heatmap computed to visually show the distances and cluster similar projects, and dists-pruned.pdf is a reduced version of the heatmap where only the most significant projects have been kept.
A heatmap is a matrix showing the distance between every pairs of projects. It is clustered in such a way that close projects (ie, those that share more source code) are displayed next to each other, forming visual cluster of similar projects. Each pair of project is displayed as a square whose colour scales from red (very similar) to blue (very distinct). Each such square is clickable which allows to open directly the corresponding page in MOSS' report. Above the heatmap (and on its right), a dendogram is displayed to show how the clustering is organised, just like in phylogenetic trees. Note that the diagonal of such a heatmap is red by definition because it compares a project with itself.
Command line options
mousse supports the following options:
-u USERID,--userid USERIDuseUSERIDto authenticate with MOSS-c CONFIG,--config CONFIGpath of configuration file (see below)-s SOURCE,--source SOURCE(default:projects) the directory withinrootwhere projects are located-b BASE,--base BASE(default: none) the directory withinrootwhere base source code is located-r REPORT,--report REPORT(default:moss) the directory where the downloaded MOSS report is saved-R,--rawuse raw distances without cleaning (see below)-p PRUNE,--prune PRUNE(default: 0.5) how full heatmap is pruned (see below)-a,--absolute(default: relative) use absolute scale for heatmap coloring (see below)-l LANG,--lang LANG(default:c) programming language used for projects-L,--list-langlist supported programming languages and exit
Then, except is -L was used, mousse expects then a path for root.
Finally, any remaining option has to be formatted as KEY=VALUE and can be used to control heatmap rendering:
lnk_OPT=VALUEOPT=VALUEis passed toscipy.cluster.hierarchy.linkagewhen computing the clusteringsns_OPT=VALUEOPT=VALUEis passed toseaborn.clustermapwhen drawing the heatmapplt_OPT=VALUEOPT=VALUEis passed tomatplotlib.pylab.savefigwhen saving the heatmap
Configuration file
A configuration file ~/.config/mousse.ini can be used to store MOSS user id, which avoids having to pass it with option -u every times:
[moss]
userid = 123456789
Additionally, this file may contain definitions for new languages (or redefinition of existing languages), for instance C is defined as:
[lang:c]
suffix = .c
search = *.c *.h
comment = ["//", ("/*", "*/")]
moss_lang = c
In this definition:
lang:cmeans we (re)define a language calledcsuffixis the extension to use when creating a single file gathering all the source code from one projectsearchis the list of files patterns searched in a project directory, all other files will be ignored and not submitted to MOSScommentis a Python list of either strings or pairs of strings that define the comments supported by the programming language. A string defines a single-line comment starting with the given marker, a pair defines a single- or multi-line enclosed into the two given markers. If the language has only single-line comments with just one marker, it may be given directly instead of a Python list, eg, for Python one may usecomment = #moss_langis the language name as MOSS knows it
Cleaning and pruning heatmaps, absolute/relative colour scale
When distance between projects are computed from MOSS reports, a heuristic is used to filter matches that are not relevant.
For each pair of projects P1, P2, and for each each match M1, M2 (ie, two fragments of source code that MOSS reported as similar), a score is computed as:
- the length of the match (maximum of the length of
M1andM2) - divided by the Levenshtein distance between
M1andM2 - multiplied by the number of lines copied between
P1andP2as reported by MOSS If all the matches forP1,P2score below the average, then this pair of project is assigned distance 1 (maximal distance). This result in practice in hiding pairs of projects with only irrelevant matches, that is, following how scores are computed, with: - short matches
- with large edit distances
- from projects with few similarities
As this is only a heuristic, it may be disabled using option -R, in which case the distance as directly computed from MOSS matches is used.
Note that even if the heuristic is applied, the heatmap remains clickable and MOSS reports can be looked at, but the corresponding square is just drawn blue.
Using option -p PRUNE allows one to control how the full heatmap can be pruned in order to keep only the pairs of most similar projects.
First, PRUNE < 0 cancels pruning, only the full heatmap is computed
Then, if PRUNE > 1, at most PRUNE projects are kept, all the other are removed.
This is performed by descending the dendogram on the heatmap progressively and removing isolated projects until we reach the expected count.
Finally, if 0 <= PRUNE <= 1, projects whose distance to any other is more than MEAN - STDEV * PRUNE are discarded, where MEAN is the average distance between every pair of projects and STDEV is the standard deviation.
Heatmap are colored by default with a relative color scale, which means that the top of the dendogram is blue and the bottom red.
When option -a is used, blue is for distance 1 and red is for 0, regardless of how the distances are actually distributed.
Relative heatmaps exhibit more contrasted colours but tend to suggest closer proximity between projects.
API
mousse may be used from Python programs.
TODO: write some documentation, including docstrings
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pymousse-0.1.2.tar.gz.
File metadata
- Download URL: pymousse-0.1.2.tar.gz
- Upload date:
- Size: 10.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a5c85727fa914563c694b08b7160b71acd66f38af71e1bd2ce9cca2855ff9e92
|
|
| MD5 |
b801a550a6b450db80caa9465e5d2be1
|
|
| BLAKE2b-256 |
a703c6a5d5c172828de41b4219597a605a082c19ff2efdd49158c023967cad8b
|
File details
Details for the file pymousse-0.1.2-py3-none-any.whl.
File metadata
- Download URL: pymousse-0.1.2-py3-none-any.whl
- Upload date:
- Size: 10.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d989c26f04984b9228d08305258a0934da1fd9b4ddde9480753ba7a3303545f2
|
|
| MD5 |
0678a030f55b51d515e94416e57cc50a
|
|
| BLAKE2b-256 |
b1956c8f163298f25ffc3285536a2907763005b7d152e15fc8003283c272e39c
|