Skip to main content

plagiarism detection using MOSS with post-processing

Project description

mousse: plagiarism detection using MOSS with post-processing

mousse is a command-line tool to submit programming assignments to MOSS plagiarism checker. Beyond submitting, it also performs the following:

  • download the report generated by MOSS
  • extract amounts of copied lines from it
  • compute distances between pairs of projects
  • generate heatmaps to visually identify clusters of similar projects

The generated heatmaps are clickable, allowing one to directly look at MOSS report for a pair of projects. Before submission, code is cleaned a little bit to ease reading reports, in particular:

  • comments are removed
  • multiple files are grouped into a unique file

mousse may also be used as a Python library to use its features separately.

(By the way, mousse if the French word for moss.)

Installation

Run pip install pymousse.

mousse depends on:

(By the way, pymousse if pronounced in French as a famous brand of candies called Pimousse, and as name mousse was already used on PyPi, it was natural to name the project like this.)

Using mousse

Overview

mousse operates on a single directory root (specified on the command line) that should be organised as follows:

root/
+- projects/
|  +- student_1/
|  +- student_2/
|  +- ...
+- base/

Directory projects (specified with option -s/--source) holds one directory for each student, where source code will be searched for to be sent to MOSS. Directory base (specified with -b/--base) is optional, and if used must contain the code that has been provided to students and will not be taken into account for plagiarism detection.

Assuming current working directory is where root is located, mousse can by run as python -m mousse -b base root, or, if no base is provided, just as python -m mousse root. After running mousse, directory root has been populated with new content:

root/
+- projects/  (untouched)
+- base/      (untouched)
+- moss/
|  +- index.html
|  +- match-0.html
|  +- ...
+- dists.csv
+- dists.pdf
+- dists-pruned.pdf

Directory moss is where MOSS report has been downloaded. File dists.csv is a CSV file with the distances computed from MOSS' matches: the more source code two projects have in common, the closer they are with respect to this distance. File dists.pdf is the heatmap computed to visually show the distances and cluster similar projects, and dists-pruned.pdf is a reduced version of the heatmap where only the most significant projects have been kept.

A heatmap is a matrix showing the distance between every pairs of projects. It is clustered in such a way that close projects (ie, those that share more source code) are displayed next to each other, forming visual cluster of similar projects. Each pair of project is displayed as a square whose colour scales from red (very similar) to blue (very distinct). Each such square is clickable which allows to open directly the corresponding page in MOSS' report. Above the heatmap (and on its right), a dendogram is displayed to show how the clustering is organised, just like in phylogenetic trees. Note that the diagonal of such a heatmap is red by definition because it compares a project with itself.

Command line options

mousse supports the following options:

  • -u USERID, --userid USERID use USERID to authenticate with MOSS
  • -c CONFIG, --config CONFIG path of configuration file (see below)
  • -i, --incremental (default: no) do not submit source and dowload MOSS report is it exists
  • -s SOURCE, --source SOURCE (default: projects) the directory within root where projects are located
  • -b BASE, --base BASE (default: none) the directory within root where base source code is located
  • -r REPORT, --report REPORT (default: moss) the directory where the downloaded MOSS report is saved
  • -R, --raw use raw distances without cleaning (see below)
  • -p PRUNE, --prune PRUNE (default: 0.5) how full heatmap is pruned (see below)
  • -a, --absolute (default: relative) use absolute scale for heatmap coloring (see below)
  • -l LANG, --lang LANG (default: c) programming language used for projects
  • -L, --list-lang list supported programming languages and exit

Then, except is -L was used, mousse expects then a path for root.

Finally, any remaining option has to be formatted as KEY=VALUE and can be used to control heatmap rendering:

  • lnk_OPT=VALUE OPT=VALUE is passed to scipy.cluster.hierarchy.linkage when computing the clustering
  • sns_OPT=VALUE OPT=VALUE is passed to seaborn.clustermap when drawing the heatmap
  • plt_OPT=VALUE OPT=VALUE is passed to matplotlib.pylab.savefig when saving the heatmap

Configuration file

A configuration file ~/.config/mousse.ini can be used to store MOSS user id, which avoids having to pass it with option -u every times:

[moss]
userid = 123456789

Additionally, this file may contain definitions for new languages (or redefinition of existing languages), for instance C is defined as:

[lang:c]
suffix = .c
search = *.c *.h
comment = ["//", ("/*", "*/")]
moss_lang = c

In this definition:

  • lang:c means we (re)define a language called c
  • suffix is the extension to use when creating a single file gathering all the source code from one project
  • search is the list of files patterns searched in a project directory, all other files will be ignored and not submitted to MOSS
  • comment is a Python list of either strings or pairs of strings that define the comments supported by the programming language. A string defines a single-line comment starting with the given marker, a pair defines a single- or multi-line enclosed into the two given markers. If the language has only single-line comments with just one marker, it may be given directly instead of a Python list, eg, for Python one may use comment = #
  • moss_lang is the language name as MOSS knows it

Cleaning and pruning heatmaps, absolute/relative colour scale

When distance between projects are computed from MOSS reports, a heuristic is used to filter matches that are not relevant. For each pair of projects P1, P2, and for each each match M1, M2 (ie, two fragments of source code that MOSS reported as similar), a score is computed as:

  • the length of the match (maximum of the length of M1 and M2)
  • divided by the Levenshtein distance between M1 and M2
  • multiplied by the number of lines copied between P1 and P2 as reported by MOSS If all the matches for P1, P2 score below the average, then this pair of project is assigned distance 1 (maximal distance). This result in practice in hiding pairs of projects with only irrelevant matches, that is, following how scores are computed, with:
  • short matches
  • with large edit distances
  • from projects with few similarities

As this is only a heuristic, it may be disabled using option -R, in which case the distance as directly computed from MOSS matches is used. Note that even if the heuristic is applied, the heatmap remains clickable and MOSS reports can be looked at, but the corresponding square is just drawn blue.

Using option -p PRUNE allows one to control how the full heatmap can be pruned in order to keep only the pairs of most similar projects. First, PRUNE < 0 cancels pruning, only the full heatmap is computed Then, if PRUNE > 1, at most PRUNE projects are kept, all the other are removed. This is performed by descending the dendogram on the heatmap progressively and removing isolated projects until we reach the expected count. Finally, if 0 <= PRUNE <= 1, projects whose distance to any other is more than MEAN - STDEV * PRUNE are discarded, where MEAN is the average distance between every pair of projects and STDEV is the standard deviation.

Heatmap are colored by default with a relative color scale, which means that the top of the dendogram is blue and the bottom red. When option -a is used, blue is for distance 1 and red is for 0, regardless of how the distances are actually distributed. Relative heatmaps exhibit more contrasted colours but tend to suggest closer proximity between projects.

API

mousse may be used from Python programs.

TODO: write some documentation, including docstrings

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pymousse-0.1.5.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

pymousse-0.1.5-py3-none-any.whl (11.0 kB view details)

Uploaded Python 3

File details

Details for the file pymousse-0.1.5.tar.gz.

File metadata

  • Download URL: pymousse-0.1.5.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.2

File hashes

Hashes for pymousse-0.1.5.tar.gz
Algorithm Hash digest
SHA256 67fc80769ec33ade10ad85b29f2a21befe6a6e3def0cf984f0b8de32050dea15
MD5 b86b42927333f6ec7eb7ff4d1147d525
BLAKE2b-256 92d78b4dc5fcb7a55647cc2896b0ae53d50aa3b7bf99a8aafb97edac20e85785

See more details on using hashes here.

File details

Details for the file pymousse-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: pymousse-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 11.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.2

File hashes

Hashes for pymousse-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 8c6acb79c68967e0f96832bac91a416dd3cd5ba3be2a27800d843f4b913d0732
MD5 70be07da7400fb6f07b796f195cb5780
BLAKE2b-256 36c9ebaf074c818da167fe63178d328d4f39333750a54e9b6699a049a54e4957

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page