Skip to main content

Code Similarity (csim) is a method designed to detect similarity between source codes

Project description

Code Similarity (csim)

Code Similarity (csim) provide a module designed to detect similarities between source code files, even when obfuscation techniques have been applied. It is particularly useful for programming instructors and students who need to verify code originality.

Key Features

  • Source Code Similarity Analysis: Compares source code files to determine their degree of similarity.
  • Advanced Analysis: Utilizes parse trees and the tree edit distance algorithm for in-depth analysis.
  • Parse Trees: Represents the syntactic structure of source code, enabling detailed comparisons.
  • Tree Edit Distance: Measures the similarity between different code structures.
  • Hash-Based Pruning: Optimizes the comparison process by reducing tree size while preserving essential structure.

Technologies Used

  • Python: The core programming language for the tool.
  • ANTLR: A parser generator for creating parse trees from source code.
  • zss: A library for calculating the tree edit distance.
  • apted: A library for computing the tree edit distance, alternatively to zss.

Installation

For the installation pip is required, you can either clone the repository and install it locally or install it directly from PyPI.

  1. Clone the repository:
    git clone https://github.com/EdsonEddy/csim.git
    
  2. Navigate to the project directory:
    cd csim
    
  3. Install the package:
    pip install .
    

Alternatively, you can install it directly from PyPI:

pip install csim

Version Compatibility

  • Python: 3.9–3.12 (recommended 3.11)
  • ANTLR4 Python Runtime: 4.13.2
  • zss: 1.2.0
  • apted: 1.0.3

Usage

csim can be used from the command line. For now, only Python files are supported; more languages will be added in future versions.

For example, to compare two Python files, run:

Option --files (Specify Files)

This option will compare two specified files and output the similarity index.

csim --files file1.py file2.py

Output

file1.py is similar to file2.py with similarity index: X.XX

Option --path (Specify Directory)

This option will compare all the files in the specified directory and output the similarity index for each pair of files. This option is expensive in terms of time complexity, so it is recommended to use it with a small number of files.

csim --path /path/to/directory  

Output

file1.py is similar to file2.py with similarity index: X.XX
file1.py is similar to file3.py with similarity index: X.XX
...
fileN.py is similar to fileM.py with similarity index: X.XX

Notes:

  • Only .py files within the directory are considered.
  • The output uses full file paths when reporting similarities.

Option --lang (Specify Language)

You can specify the input language. Currently, only python is supported and it is the default.

csim --files file1.py file2.py --lang python

Option --threshold (Specify Similarity Threshold)

You can specify a similarity threshold to group files based on their similarity. Only available when using the --files option. If the similarity index is above the threshold, it will be reported in the output.

csim --path /path/to/directory --threshold 0.7

Output

Threshold: 0.7
Total files processed: N
Group 1 (Average similarity: X.XX):
  file1.py
  file2.py
Group 2 (Average similarity: X.XX):
  file3.py
  file4.py
...
Unique files (similarity below threshold):
  fileN.py

Option --talg (Specify Tree Edit Distance Algorithm)

You can specify the tree edit distance algorithm to use for comparisons. The available options are zss (default) and apted.

csim --files file1.py file2.py --talg apted

Alternatively, you can use csim as a Python module:

from csim import Compare
code_a = "a = 5"
code_b = "c = 50"
similarity = Compare(name_a = 'example A', content_a = code_a, name_b = 'example B', content_b = code_b)
print(f"Similarity: {similarity}") # Output: Similarity: X.XX

ANTLR4 Installation and Parser/Lexer Generation

This installation is not required—the generated files are already included in the project. If you'd like to review the steps to generate them yourself, see grammars/parser_gen_guide.md.

Note: The included generated files were produced by ANTLR 4.13.2 and are compatible with the pinned runtime listed above.

Contributing

Contributions are welcome! To contribute, please follow these steps:

  1. Fork the repository.
  2. Create a new branch (git checkout -b feature/new-feature).
  3. Make your changes and commit them (git commit -am 'Add new feature').
  4. Push to the branch (git push origin feature/new-feature).
  5. Open a Pull Request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Links

Additional Resources

For more information on the techniques and tools used in this project, refer to the following resources:

Third-Party Licenses

This project utilizes the following third-party libraries:

ANTLR (ANother Tool for Language Recognition)

ANTLR4-parser-for-Python-3.14 by RobEin

zss (Zhang-Shasha)

apted (All Path Tree Edit Distance)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csim-1.7.0.tar.gz (777.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

csim-1.7.0-py3-none-any.whl (99.8 kB view details)

Uploaded Python 3

File details

Details for the file csim-1.7.0.tar.gz.

File metadata

  • Download URL: csim-1.7.0.tar.gz
  • Upload date:
  • Size: 777.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for csim-1.7.0.tar.gz
Algorithm Hash digest
SHA256 78b62fe930c4d205d916fe7b8cb0b5b62ae9e1b7f6654a554717bbccf09d78a9
MD5 ae5f794fa458843838d0d584ab039a62
BLAKE2b-256 5e6f946a06e8bf4a410eeb07248e12d34416881c484403f13945023112734a90

See more details on using hashes here.

File details

Details for the file csim-1.7.0-py3-none-any.whl.

File metadata

  • Download URL: csim-1.7.0-py3-none-any.whl
  • Upload date:
  • Size: 99.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for csim-1.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4772c4fddc39a1e7a032c9113832158ad0a680a457d8888dc18ee34bc6688c79
MD5 8b1f6b63bec46ee2bfa6127c6ea1fc34
BLAKE2b-256 69607a2e38e26ccf1f4e6ba558879d846cd78a90adc901abdaa422848d93d30f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page