plagiarism detection using MOSS with post-processing
Project description
mousse
: plagiarism detection using MOSS with post-processing
mousse
is a command-line tool to submit programming assignments to MOSS plagiarism checker.
Beyond submitting, it also performs the following:
- download the report generated by MOSS
- extract amounts of copied lines from it
- compute distances between pairs of projects
- generate heatmaps to visually identify clusters of similar projects
The generated heatmaps are clickable, allowing one to directly look at MOSS report for a pair of projects. Before submission, code is cleaned a little bit to ease reading reports, in particular:
- comments are removed
- multiple files are grouped into a unique file
mousse
may also be used as a Python library to use its features separately.
(By the way, mousse if the French word for moss.)
Installation
Run pip install pymousse
.
mousse
depends on:
- Python 3 (developed with 3.9, may work with earlier versions)
requests
beautifulsoup4
chardet
tqdm
mosspy
thefuzz
pandas
numpy
scipy
seaborn
matplotlib
(By the way, pymousse if pronounced in French as a famous brand of candies called Pimousse, and as name mousse was already used on PyPi, it was natural to name the project like this.)
Using mousse
Overview
mousse
operates on a single directory root
(specified on the command line) that should be organised as follows:
root/
+- projects/
| +- student_1/
| +- student_2/
| +- ...
+- base/
Directory projects
(specified with option -s
/--source
) holds one directory for each student, where source code will be searched for to be sent to MOSS.
Directory base
(specified with -b
/--base
) is optional, and if used must contain the code that has been provided to students and will not be taken into account for plagiarism detection.
Assuming current working directory is where root
is located, mousse
can by run as python -m mousse -b base root
, or, if no base
is provided, just as python -m mousse root
.
After running mousse
, directory root
has been populated with new content:
root/
+- projects/ (untouched)
+- base/ (untouched)
+- moss/
| +- index.html
| +- match-0.html
| +- ...
+- dists.csv
+- dists.pdf
+- dists-pruned.pdf
Directory moss
is where MOSS report has been downloaded.
File dists.csv
is a CSV file with the distances computed from MOSS' matches: the more source code two projects have in common, the closer they are with respect to this distance.
File dists.pdf
is the heatmap computed to visually show the distances and cluster similar projects, and dists-pruned.pdf
is a reduced version of the heatmap where only the most significant projects have been kept.
A heatmap is a matrix showing the distance between every pairs of projects. It is clustered in such a way that close projects (ie, those that share more source code) are displayed next to each other, forming visual cluster of similar projects. Each pair of project is displayed as a square whose colour scales from red (very similar) to blue (very distinct). Each such square is clickable which allows to open directly the corresponding page in MOSS' report. Above the heatmap (and on its right), a dendogram is displayed to show how the clustering is organised, just like in phylogenetic trees. Note that the diagonal of such a heatmap is red by definition because it compares a project with itself.
Command line options
mousse
supports the following options:
-u USERID
,--userid USERID
useUSERID
to authenticate with MOSS-c CONFIG
,--config CONFIG
path of configuration file (see below)-s SOURCE
,--source SOURCE
(default:projects
) the directory withinroot
where projects are located-b BASE
,--base BASE
(default: none) the directory withinroot
where base source code is located-r REPORT
,--report REPORT
(default:moss
) the directory where the downloaded MOSS report is saved-R
,--raw
use raw distances without cleaning (see below)-p PRUNE
,--prune PRUNE
(default: 0.5) how full heatmap is pruned (see below)-a
,--absolute
(default: relative) use absolute scale for heatmap coloring (see below)-l LANG
,--lang LANG
(default:c
) programming language used for projects-L
,--list-lang
list supported programming languages and exit
Then, except is -L
was used, mousse
expects then a path for root
.
Finally, any remaining option has to be formatted as KEY=VALUE
and can be used to control heatmap rendering:
lnk_OPT=VALUE
OPT=VALUE
is passed toscipy.cluster.hierarchy.linkage
when computing the clusteringsns_OPT=VALUE
OPT=VALUE
is passed toseaborn.clustermap
when drawing the heatmapplt_OPT=VALUE
OPT=VALUE
is passed tomatplotlib.pylab.savefig
when saving the heatmap
Configuration file
A configuration file ~/.config/mousse.ini
can be used to store MOSS user id, which avoids having to pass it with option -u
every times:
[moss]
userid = 123456789
Additionally, this file may contain definitions for new languages (or redefinition of existing languages), for instance C is defined as:
[lang:c]
suffix = .c
search = *.c *.h
comment = ["//", ("/*", "*/")]
moss_lang = c
In this definition:
lang:c
means we (re)define a language calledc
suffix
is the extension to use when creating a single file gathering all the source code from one projectsearch
is the list of files patterns searched in a project directory, all other files will be ignored and not submitted to MOSScomment
is a Python list of either strings or pairs of strings that define the comments supported by the programming language. A string defines a single-line comment starting with the given marker, a pair defines a single- or multi-line enclosed into the two given markers. If the language has only single-line comments with just one marker, it may be given directly instead of a Python list, eg, for Python one may usecomment = #
moss_lang
is the language name as MOSS knows it
Cleaning and pruning heatmaps, absolute/relative colour scale
When distance between projects are computed from MOSS reports, a heuristic is used to filter matches that are not relevant.
For each pair of projects P1
, P2
, and for each each match M1
, M2
(ie, two fragments of source code that MOSS reported as similar), a score is computed as:
- the length of the match (maximum of the length of
M1
andM2
) - divided by the Levenshtein distance between
M1
andM2
- multiplied by the number of lines copied between
P1
andP2
as reported by MOSS If all the matches forP1
,P2
score below the average, then this pair of project is assigned distance 1 (maximal distance). This result in practice in hiding pairs of projects with only irrelevant matches, that is, following how scores are computed, with: - short matches
- with large edit distances
- from projects with few similarities
As this is only a heuristic, it may be disabled using option -R
, in which case the distance as directly computed from MOSS matches is used.
Note that even if the heuristic is applied, the heatmap remains clickable and MOSS reports can be looked at, but the corresponding square is just drawn blue.
Using option -p PRUNE
allows one to control how the full heatmap can be pruned in order to keep only the pairs of most similar projects.
First, PRUNE < 0
cancels pruning, only the full heatmap is computed
Then, if PRUNE > 1
, at most PRUNE
projects are kept, all the other are removed.
This is performed by descending the dendogram on the heatmap progressively and removing isolated projects until we reach the expected count.
Finally, if 0 <= PRUNE <= 1
, projects whose distance to any other is more than MEAN - STDEV * PRUNE
are discarded, where MEAN
is the average distance between every pair of projects and STDEV
is the standard deviation.
Heatmap are colored by default with a relative color scale, which means that the top of the dendogram is blue and the bottom red.
When option -a
is used, blue is for distance 1 and red is for 0, regardless of how the distances are actually distributed.
Relative heatmaps exhibit more contrasted colours but tend to suggest closer proximity between projects.
API
mousse
may be used from Python programs.
TODO: write some documentation, including docstrings
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.