Skip to main content

Detect similarities between Python source files

Project description

pyastsim

Build Status PyPI version

Calculates the similarity between a batch of source files.

Installation

The program can be installed using pip:

pip3 install pyastsim

Usage

usage: pyastsim [-h] [--threshold THRESHOLD] [--show-diff]
                [--function FUNCTION]
                files [files ...]

Check source files for similarity

positional arguments:
  files                 List of files to compare

optional arguments:
  -h, --help            show this help message and exit
  --threshold THRESHOLD
                        Similarity threshold. Values below this are not
                        reported.
  --show-diff           Show entire diff when reporting results.
  --function FUNCTION   Specific function to compare (Python source only)

Examples

Show check for similarity of a group of files using default settings:

pyastsim *.py

Set a custom threshold to be more or less sensative (default threshold is 80% similarity):

pyastsim --threshold 90 *.py

Show full diffs when reporting similar files:

pyastsim --show-diff *.py

Remove all but one function from the AST before performing comparison:

pyastsim --function my_func *.py

Language Support

  • Python (using internal AST for comparison)
  • C/C++ (using GCC assembly output for comparison)

Difference Calculation

The difference is calculated by first converting each supplied file to an abstract syntax tree (AST). The AST is then normalized to remove comments, docstrings, and standardize identifier names. We then convert the AST back to Python source code and calculate the Damerau–Levenshtein distance between each pair of source files. We further normalize this number by dividing it by the mean of the number of unicode code points in the files being compared. This gives us a rough percentage similarity between our files. To summarize:

  1. Convert to AST
  2. Remove comments and docstrings
  3. Normalize identifiers
  4. Convert back to source
  5. Calculate Damerau–Levenshtein distance
  6. Covert the edit distance to a percentage

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyastsim-1.2.0.tar.gz (3.6 kB view details)

Uploaded Source

Built Distribution

pyastsim-1.2.0-py3-none-any.whl (4.1 kB view details)

Uploaded Python 3

File details

Details for the file pyastsim-1.2.0.tar.gz.

File metadata

  • Download URL: pyastsim-1.2.0.tar.gz
  • Upload date:
  • Size: 3.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.8.10

File hashes

Hashes for pyastsim-1.2.0.tar.gz
Algorithm Hash digest
SHA256 54f33e5d0bd66e9e2ba53001f80967925f6449a49b2ced12e04c19de9c271eac
MD5 70f51e7c70abdfcabc78fa314f717f08
BLAKE2b-256 db20834752e40ab00c4f87688b28539fc710cf789defa41f6bad0a29785a8a82

See more details on using hashes here.

File details

Details for the file pyastsim-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: pyastsim-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 4.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.8.10

File hashes

Hashes for pyastsim-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 223a3ed6a75211c9f45e0b4503f6bac0f3c1d60995d85fae1a8abd729733eab3
MD5 0a51785a7d025fdc82f8fcb9e2cbff37
BLAKE2b-256 a69e93ec8ee67dc3e57668650a8fa086b878cd5e03f02d3eab416bb39d80d854

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page