Skip to main content

A simple plagiarism detection tool for python code

Project description

This is a simple plagiarism detection tool for python code, the basic idea is to normalize python AST representation and use difflib to get the modification from referenced code to candidate code. The plagiarism defined in pycode_similar is how many referenced code is plagiarized by candidate code, which means swap referenced code and candidate code will get different result.

It only cost me a couple of hours to implement this tool, so there is still a long way to improve the speed and accuracy, but it already performs great in detecting the plagiarism of new recruits’ homeworks in our company.

Compare to Moss

  • pure python implementation

  • only contains one source file

  • no third-party dependency (except zss when use TreeDiff)

  • no need to register account for Moss

  • no need of network to access Moss

This tool was born before I know there is a Moss (for a Measure Of Software Similarity) to determine the similarity of programs. And I have tried many ways to register account for Stanford Moss, but still can’t get a valid account. So, I have no accurate comparison between pycode_similar and Moss.

Installation

If you don’t have much time, just perform

$ pip install pycode_similar

which will install the module(without tests) on your system.

Also, you can just copy & paste the pycode_similar.py which require no third-party dependency.

Usage

Just use it as a standard command line tool if pip install properly.

$ pycode_similar
usage: pycode_similar [-h] [-l L] [-p P] [-k] [-m] files files

A simple plagiarism detection tool for python code

positional arguments:
  files       the input files

optional arguments:
  -h, --help          show this help message and exit
  -l L                if AST line of the function >= value then output detail (default: 4)
  -p P                if plagiarism percentage of the function >= value then output detail (default: 0.5)
  -k, --keep-prints   keep print nodes
  -m, --module-level  process module level nodes

pycode_similar: error: too few arguments

Of course, you can use it as a python library, too.

import pycode_similar
pycode_similar.detect([referenced_code_str, candidate_code_str1, candidate_code_str2, ...], diff_method=pycode_similar.UnifiedDiff, keep_prints=False, module_level=False)

Implementation

This tool has implemented two diff methods: line based diff(UnifiedDiff) and tree edit distance based diff(TreeDiff), both of them are run in function AST level.

  • UnifiedDiff, diff normalized function AST string lines, naive but efficiency.

  • TreeDiff, diff function AST, very slow and the result is not good for small functions. (depends on zss)

So, when run this tool in cmd, the default diff method is UnifiedDiff. And you can switch to TreeDiff when use it as a library.

Testing

If you have the source code you can run the tests with

$ python pycode_similar/tests/test_cases.py

Or perform

$ python pycode_similar.py pycode_similar/tests/original_version.py pycode_similar.py

ref: tests/original_version.py
candidate: pycode_similar.py
80.14 % (803/1002) of ref code structure is plagiarized by candidate.
candidate function plagiarism details (AST lines >= 4 and plagiarism percentage >= 0.5):
1.0 : ref FuncNodeCollector._mark_docstring_sub_nodes<24:4>, candidate FuncNodeCollector._mark_docstring_sub_nodes<27:4>
1.0 : ref FuncNodeCollector._mark_docstring_nodes<54:8>, candidate FuncNodeCollector._mark_docstring_nodes<57:8>
1.0 : ref FuncNodeCollector.generic_visit<69:4>, candidate FuncNodeCollector.generic_visit<72:4>
1.0 : ref FuncNodeCollector.visit_Str<74:4>, candidate FuncNodeCollector.visit_Str<78:4>
1.0 : ref FuncNodeCollector.visit_Name<83:4>, candidate FuncNodeCollector.visit_Name<88:4>
1.0 : ref FuncNodeCollector.visit_Attribute<89:4>, candidate FuncNodeCollector.visit_Name<88:4>
1.0 : ref FuncNodeCollector.visit_ClassDef<95:4>, candidate FuncNodeCollector.visit_ClassDef<100:4>
1.0 : ref FuncNodeCollector.visit_FunctionDef<101:4>, candidate FuncNodeCollector.visit_FunctionDef<106:4>
1.0 : ref FuncInfo.__init__<141:4>, candidate FuncInfo.__init__<161:4>
1.0 : ref FuncInfo.__str__<151:4>, candidate FuncInfo.__str__<171:4>
1.0 : ref FuncInfo.func_code<162:4>, candidate FuncInfo.func_code<182:4>
1.0 : ref FuncInfo.func_code_lines<168:4>, candidate FuncInfo.func_code_lines<188:4>
1.0 : ref FuncInfo.func_ast<174:4>, candidate FuncInfo.func_ast<194:4>
1.0 : ref FuncInfo.func_ast_lines<180:4>, candidate FuncInfo.func_ast_lines<200:4>
1.0 : ref FuncInfo._retrieve_func_code_lines<186:4>, candidate FuncInfo._retrieve_func_code_lines<206:4>
1.0 : ref FuncInfo._iter_node<208:4>, candidate FuncInfo._iter_node<228:4>
1.0 : ref FuncInfo._dump<232:4>, candidate FuncInfo._dump<252:4>
1.0 : ref FuncInfo._inner_dump<242:8>, candidate FuncInfo._inner_dump<262:8>
1.0 : ref ArgParser.error<267:4>, candidate ArgParser.error<291:4>
0.95: ref unified_diff<281:0>, candidate UnifiedDiff._gen<339:8>
0.92: ref FuncNodeCollector.__init__<18:4>, candidate FuncNodeCollector.__init__<20:4>
0.92: ref FuncNodeCollector.visit_Compare<108:4>, candidate FuncNodeCollector._simple_nomalize<117:8>
0.89: ref FuncNodeCollector.visit_Expr<79:4>, candidate FuncNodeCollector.visit_Expr<83:4>

Click here to view this diff -> 0.92: ref FuncNodeCollector.visit_Compare<108:4>, candidate FuncNodeCollector._simple_nomalize<117:8>

Repository

The project is hosted on GitHub. You can look at the source here:

https://github.com/fyrestone/pycode_similar

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycode_similar-1.4.tar.gz (15.6 kB view details)

Uploaded Source

Built Distribution

pycode_similar-1.4-py2.py3-none-any.whl (10.0 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file pycode_similar-1.4.tar.gz.

File metadata

  • Download URL: pycode_similar-1.4.tar.gz
  • Upload date:
  • Size: 15.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.5

File hashes

Hashes for pycode_similar-1.4.tar.gz
Algorithm Hash digest
SHA256 41c25bc9804a80c750fcbed1be6b8acc25cf88939933779d2d0f9ef47a2a99db
MD5 b2f9486703b10f944d565c9e77acd7ed
BLAKE2b-256 5faa406aedb4e1f9deadeee077585a5e17d3d33b4371b35832f08c31d2e6ae2a

See more details on using hashes here.

File details

Details for the file pycode_similar-1.4-py2.py3-none-any.whl.

File metadata

  • Download URL: pycode_similar-1.4-py2.py3-none-any.whl
  • Upload date:
  • Size: 10.0 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.5

File hashes

Hashes for pycode_similar-1.4-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 d3f8cb2dc7233147af6b70a0aea009b541e1dc13e608b6e2bfba6b3c5f1e677a
MD5 6a3f930f93ace9b3ac7427b4e3e6001f
BLAKE2b-256 e51b7c41caaa9f45decb755adce1efed6826b93ef91770a0def9c61c09c245be

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page