Skip to main content

CD4Py: Code De-Duplication for Python

Project description

Intro

CD4Py is a code de-duplication tool for Python programming language. It detects near and exact duplicate source code files. To train a machine learning model on source code files, it is essential to identify and remove duplicate source code files from the dataset. Otherwise, code duplication significantly affects the practicality of machine learning-based tools, especially on unseen data.

Quick Installation

$ git clone https://github.com/saltudelft/CD4Py.git & cd CD4Py
$ pip install .

Usage

$ cd4py --help
usage: cd4py [-h] --p P --od OD --ot OT [--d D] [--th TH] [--k K] [--tr TR]

Code De-Duplication for Python

optional arguments:
  -h, --help  show this help message and exit
  --p P       Path to Python projects
  --od OD     Output folder to store detected duplicate files.
  --ot OT     Output folder to store tokenized files.
  --d D       Dimension of TF-IDF vectors [default: 2048].
  --th TH     Threshold to identify duplicate files [default: 0.95].
  --k K       Number of nearest neighbor [default: 10].
  --tr TR     Number trees to build the index. More trees gives higher
              precision but slower [default: 20].

Examples

  • Run CD4Py to identify duplicate files for a Python dataset
$ cd4py --p $PYHON_DATASET --ot $TOKENS --od py_dataset_duplicates.jsonl.gz --d 1024

Replace $PYHON_DATASET with the path to the Python project folders and $TOKENS with the path to store tokenized project files. Also, note that detected duplicate files will be stored in the file py_dataset_duplicates.jsonl.gz.

  • The following code example shows the removal of duplicate files using the example file py_dataset_duplicates.jsonl.gz:
from dpu_utils.utils.dataloading import load_jsonl_gz
import random
# Selects randomly a file from each cluster of duplicate files
clusters_rand_files = [l.pop(random.randrange(len(l))) for l in load_jsonl_gz('py_dataset_duplicates.jsonl.gz')]
duplicate_files = [f for l in load_jsonl_gz('py_dataset_duplicates.jsonl.gz') for f in l]
duplicate_files = set(duplicate_files).difference(set(clusters_rand_files))

Approach

The CD4Py code de-duplication tool uses the following procedure to identify duplicate files in a Python code corpus:

  1. Tokenize all the source code files in the code corpus using tokenize module of Python standard library.
  2. Preprocess tokenized source files by only selecting identifier tokens and removing language keywords.
  3. Convert pre-processed tokenized files to a vector representation using the TF-IDF method.
  4. Perform k-nearest neighbor search to find k candidate duplicate files for each source code file. Next, filter out candidate duplicate files by considering the threshold t.
  5. Find clusters of duplicate source code files while assuming that similarity is transitive.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cd4py-0.1.0.tar.gz (6.8 kB view details)

Uploaded Source

Built Distribution

cd4py-0.1.0-py3-none-any.whl (20.2 kB view details)

Uploaded Python 3

File details

Details for the file cd4py-0.1.0.tar.gz.

File metadata

  • Download URL: cd4py-0.1.0.tar.gz
  • Upload date:
  • Size: 6.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.39.0 CPython/3.7.4

File hashes

Hashes for cd4py-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1fbb7d8f759412dbf533aaf1a8673e922b422483cb488ec998952033f2d925a7
MD5 c567a0ae893d80990fdf23dc83eecb3a
BLAKE2b-256 71f426824aec8d409492ffd5c5711a620f5f93f7d406041b255c3cd9f8e7ecd1

See more details on using hashes here.

File details

Details for the file cd4py-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: cd4py-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 20.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.39.0 CPython/3.7.4

File hashes

Hashes for cd4py-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 078bb417bb6994764c5f5d546339ded14e0ce5521250fe46432678ec172dce9d
MD5 34f6d67073f231ac78c31a72f542155b
BLAKE2b-256 e1fd1b1e40d146be59427599ceba35fdff54f56100f4f1ac12ba8e467a2f7025

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page