Skip to main content

Code Similarity (csim) is a method designed to detect similarity between source codes

Project description

Code Similarity (csim)

Code Similarity (csim) provide a module designed to detect similarities between source code files, even when obfuscation techniques have been applied. It is particularly useful for programming instructors and students who need to verify code originality.

Key Features

  • Source Code Similarity Analysis: Compares source code files to determine their degree of similarity.
  • Advanced Analysis: Utilizes parse trees and the tree edit distance algorithm for in-depth analysis.
  • Parse Trees: Represents the syntactic structure of source code, enabling detailed comparisons.
  • Tree Edit Distance: Measures the similarity between different code structures.

Technologies Used

  • Python: The core programming language for the tool.
  • ANTLR: A parser generator for creating parse trees from source code.
  • zss: A library for calculating the tree edit distance.

Installation

  1. Clone the repository:
    git clone https://github.com/EdsonEddy/csim.git
    
  2. Navigate to the project directory:
    cd csim
    
  3. Install the package:
    pip install .
    

Usage

csim can be used from the command line as follows:

csim -f file1.py file2.py

Alternatively, you can use csim as a Python module:

from csim import Compare
code_a = "a = 5"
code_b = "c = 50"
similarity = Compare(code_a, code_b)
print(f"Similarity: {similarity}")

Parser Generation

This section describes how to regenerate the parser files using ANTLR 4. You do not need to follow these steps unless you intend to modify the grammar.

The Python parser files (e.g., PythonLexer.py, PythonParser.py, PythonParserVisitor.py) located in the csim/ directory were generated using the ANTLR 4 tool. The grammar files (PythonLexer.g4 and PythonParser.g4) were sourced from the antlr/grammars-v4/python3_13 repository.

To regenerate the files, run the following command from the grammars/ directory:

antlr4 -Dlanguage=Python3 -visitor -o ../csim/ PythonLexer.g4 PythonParser.g4

This command instructs ANTLR to generate Python 3 code (-Dlanguage=Python3), create a visitor class (-visitor), and output the resulting files into the ../csim/ directory.

Additionally, we need download PythonLexerBase.py file from the ANTLR4 grammars GitHub repository and move them to the csim directory:

curl -O https://raw.githubusercontent.com/antlr/grammars-v4/master/python/python3_13/Python3/PythonLexerBase.py 

Contributing

Contributions are welcome! To contribute, please follow these steps:

  1. Fork the repository.
  2. Create a new branch (git checkout -b feature/new-feature).
  3. Make your changes and commit them (git commit -am 'Add new feature').
  4. Push to the branch (git push origin feature/new-feature).
  5. Open a Pull Request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Links

Additional Resources

For more information on the techniques and tools used in this project, refer to the following resources:

Third-Party Licenses

This project utilizes the following third-party libraries:

ANTLR (ANother Tool for Language Recognition)

ANTLR4-parser-for-Python-3.14 by RobEin

zss (Zhang-Shasha)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csim-1.1.2.tar.gz (449.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

csim-1.1.2-py3-none-any.whl (91.7 kB view details)

Uploaded Python 3

File details

Details for the file csim-1.1.2.tar.gz.

File metadata

  • Download URL: csim-1.1.2.tar.gz
  • Upload date:
  • Size: 449.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for csim-1.1.2.tar.gz
Algorithm Hash digest
SHA256 73c08abbd03e4fad9fedb8d3657738d8772f94129e7f2139afc6a700197860ce
MD5 aef6328c16c748d72c5cc47d2b61610e
BLAKE2b-256 467fb38cc7e8fb25c6833f334e0f4a981c666c5d4bb76e4fb7845dd2a69199d3

See more details on using hashes here.

File details

Details for the file csim-1.1.2-py3-none-any.whl.

File metadata

  • Download URL: csim-1.1.2-py3-none-any.whl
  • Upload date:
  • Size: 91.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for csim-1.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 15c0a1779151b35965137c441e3a2c8ecb0a7a5cd9e9dc004f8556746ce8d1d1
MD5 f1ad4df13f66ec2b60bf6107d8659fd0
BLAKE2b-256 c16fa39b1600f7b95a83088acd6726d21f1f7ebd11d7ca5447aec75d71d97f1c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page