Skip to main content

A simple plagiarism detection tool for python code

Project description

This is a simple plagiarism detection tool for python code, the basic idea is to normalize python AST representation and use difflib to get the modification from referenced code to candidate code. The plagiarism defined in pycode_similar is how many referenced code is plagiarized by candidate code, which means swap referenced code and candidate code will get different result.

It only cost me a couple of hours to implement this tool, so there is still a long way to improve the speed and accuracy, but it already performs great in detecting the plagiarism of new recruits’ homeworks in our company.

Compare to Moss

  • pure python implementation

  • only contains one source file

  • no third-party dependency (except zss when use TreeDiff)

  • no need to register account for Moss

  • no need of network to access Moss

This tool was born before I know there is a Moss (for a Measure Of Software Similarity) to determine the similarity of programs. And I have tried many ways to register account for Stanford Moss, but still can’t get a valid account. So, I have no accurate comparison between pycode_similar and Moss.

Installation

If you don’t have much time, just perform

$ pip install pycode_similar

which will install the module(without tests) on your system.

Also, you can just copy & paste the pycode_similar.py which require no third-party dependency.

Usage

Just use it as a standard command line tool if pip install properly.

$ pycode_similar
usage: pycode_similar [-h] [-l L] [-p P] files files

A simple plagiarism detection tool for python code

positional arguments:
  files       the input files

optional arguments:
  -h, --help  show this help message and exit
  -l L        if AST line of the function >= value then output detail
              (default: 4)
  -p P        if plagiarism percentage of the function >= value then output
              detail (default: 0.5)

pycode_similar: error: too few arguments

Of course, you can use it as a python library, too.

import pycode_similar
pycode_similar.detect([referenced_code_str, candidate_code_str1, candidate_code_str2, ...], diff_method=UnifiedDiff)

Implementation

This tool has implemented two diff methods: line based diff(UnifiedDiff) and tree edit distance based diff(TreeDiff), both of them are run in function AST level.

  • UnifiedDiff, diff normalized function AST string lines, naive but efficiency.

  • TreeDiff, diff function AST, very slow and the result is not good for small functions. (depends on zss)

So, when run this tool in cmd, the default diff method is UnifiedDiff. And you can switch to TreeDiff when use it as a library.

Testing

If you have the source code you can run the tests with

$ python pycode_similar/tests/test_cases.py

Or perform

$ python pycode_similar.py pycode_similar/tests/original_version.py pycode_similar.py

ref: tests/original_version.py
candidate: pycode_similar.py
80.14 % (803/1002) of ref code structure is plagiarized by candidate.
candidate function plagiarism details (AST lines >= 4 and plagiarism percentage >= 0.5):
1.0 : ref FuncNodeCollector._mark_docstring_sub_nodes<24:4>, candidate FuncNodeCollector._mark_docstring_sub_nodes<27:4>
1.0 : ref FuncNodeCollector._mark_docstring_nodes<54:8>, candidate FuncNodeCollector._mark_docstring_nodes<57:8>
1.0 : ref FuncNodeCollector.generic_visit<69:4>, candidate FuncNodeCollector.generic_visit<72:4>
1.0 : ref FuncNodeCollector.visit_Str<74:4>, candidate FuncNodeCollector.visit_Str<78:4>
1.0 : ref FuncNodeCollector.visit_Name<83:4>, candidate FuncNodeCollector.visit_Name<88:4>
1.0 : ref FuncNodeCollector.visit_Attribute<89:4>, candidate FuncNodeCollector.visit_Name<88:4>
1.0 : ref FuncNodeCollector.visit_ClassDef<95:4>, candidate FuncNodeCollector.visit_ClassDef<100:4>
1.0 : ref FuncNodeCollector.visit_FunctionDef<101:4>, candidate FuncNodeCollector.visit_FunctionDef<106:4>
1.0 : ref FuncInfo.__init__<141:4>, candidate FuncInfo.__init__<161:4>
1.0 : ref FuncInfo.__str__<151:4>, candidate FuncInfo.__str__<171:4>
1.0 : ref FuncInfo.func_code<162:4>, candidate FuncInfo.func_code<182:4>
1.0 : ref FuncInfo.func_code_lines<168:4>, candidate FuncInfo.func_code_lines<188:4>
1.0 : ref FuncInfo.func_ast<174:4>, candidate FuncInfo.func_ast<194:4>
1.0 : ref FuncInfo.func_ast_lines<180:4>, candidate FuncInfo.func_ast_lines<200:4>
1.0 : ref FuncInfo._retrieve_func_code_lines<186:4>, candidate FuncInfo._retrieve_func_code_lines<206:4>
1.0 : ref FuncInfo._iter_node<208:4>, candidate FuncInfo._iter_node<228:4>
1.0 : ref FuncInfo._dump<232:4>, candidate FuncInfo._dump<252:4>
1.0 : ref FuncInfo._inner_dump<242:8>, candidate FuncInfo._inner_dump<262:8>
1.0 : ref ArgParser.error<267:4>, candidate ArgParser.error<291:4>
0.95: ref unified_diff<281:0>, candidate UnifiedDiff._gen<339:8>
0.92: ref FuncNodeCollector.__init__<18:4>, candidate FuncNodeCollector.__init__<20:4>
0.92: ref FuncNodeCollector.visit_Compare<108:4>, candidate FuncNodeCollector._simple_nomalize<117:8>
0.89: ref FuncNodeCollector.visit_Expr<79:4>, candidate FuncNodeCollector.visit_Expr<83:4>

Click here to view this diff -> 0.92: ref FuncNodeCollector.visit_Compare<108:4>, candidate FuncNodeCollector._simple_nomalize<117:8>

Repository

The project is hosted on GitHub. You can look at the source here:

https://github.com/fyrestone/pycode_similar

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycode_similar-1.1.tar.gz (8.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pycode_similar-1.1-py2.py3-none-any.whl (11.4 kB view details)

Uploaded Python 2Python 3

File details

Details for the file pycode_similar-1.1.tar.gz.

File metadata

  • Download URL: pycode_similar-1.1.tar.gz
  • Upload date:
  • Size: 8.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for pycode_similar-1.1.tar.gz
Algorithm Hash digest
SHA256 53b139cd517eb2b16b1bf0da353dcf9e3f4bc8535c33eb1852f6e494c32290c1
MD5 2f2b46b597d399d1b53249c23b0c3ef2
BLAKE2b-256 74820ef6033c564268b3413f484c091fa5194a54a8ac9cf5cf8606c8bea92634

See more details on using hashes here.

File details

Details for the file pycode_similar-1.1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for pycode_similar-1.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 9df9b61b2c97a36b9d9740e1192a24b33adc52ca551b049cdca1645e4bbfb1ac
MD5 cf3767e8104c7090bd6f0084fbb66f33
BLAKE2b-256 3615051d2d651e79c26f4a7729512029f827c64394bc3f2db5eb7f28ceefa373

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page