Skip to main content

Detect similarities between Python source files

Project description


Build Status PyPI version

Calculates the similarity between a batch of source files.


The program can be installed using pip:

pip3 install pyastsim


usage: pyastsim [-h] [--threshold THRESHOLD] [--show-diff]
                [--function FUNCTION]
                files [files ...]

Check source files for similarity

positional arguments:
  files                 List of files to compare

optional arguments:
  -h, --help            show this help message and exit
  --threshold THRESHOLD
                        Similarity threshold. Values below this are not
  --show-diff           Show entire diff when reporting results.
  --function FUNCTION   Specific function to compare (Python source only)


Show check for similarity of a group of files using default settings:

pyastsim *.py

Set a custom threshold to be more or less sensative (default threshold is 80% similarity):

pyastsim --threshold 90 *.py

Show full diffs when reporting similar files:

pyastsim --show-diff *.py

Remove all but one function from the AST before performing comparison:

pyastsim --function my_func *.py

Language Support

  • Python (using internal AST for comparison)
  • C/C++ (using GCC assembly output for comparison)

Difference Calculation

The difference is calculated by first converting each supplied file to an abstract syntax tree (AST). The AST is then normalized to remove comments, docstrings, and standardize identifier names. We then convert the AST back to Python source code and calculate the Damerau–Levenshtein distance between each pair of source files. We further normalize this number by dividing it by the mean of the number of unicode code points in the files being compared. This gives us a rough percentage similarity between our files. To summarize:

  1. Convert to AST
  2. Remove comments and docstrings
  3. Normalize identifiers
  4. Convert back to source
  5. Calculate Damerau–Levenshtein distance
  6. Covert the edit distance to a percentage

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyastsim-1.2.0.tar.gz (3.6 kB view hashes)

Uploaded source

Built Distribution

pyastsim-1.2.0-py3-none-any.whl (4.1 kB view hashes)

Uploaded py3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page