CD4Py: Code De-Duplication for Python
Project description
Intro
CD4Py is a code de-duplication tool for Python programming language. It detects near and exact duplicate source code files. To train a machine learning model on source code files, it is essential to identify and remove duplicate source code files from the dataset. Otherwise, code duplication significantly affects the practicality of machine learning-based tools, especially on unseen data.
Quick Installation
$ git clone https://github.com/saltudelft/CD4Py.git & cd CD4Py
$ pip install .
Usage
$ cd4py --help
usage: cd4py [-h] --p P --od OD --ot OT [--d D] [--th TH] [--k K] [--tr TR]
Code De-Duplication for Python
optional arguments:
-h, --help show this help message and exit
--p P Path to Python projects
--od OD Output folder to store detected duplicate files.
--ot OT Output folder to store tokenized files.
--d D Dimension of TF-IDF vectors [default: 2048].
--th TH Threshold to identify duplicate files [default: 0.95].
--k K Number of nearest neighbor [default: 10].
--tr TR Number trees to build the index. More trees gives higher
precision but slower [default: 20].
Examples
- Run
CD4Py
to identify duplicate files for a Python dataset
$ cd4py --p $PYHON_DATASET --ot $TOKENS --od py_dataset_duplicates.jsonl.gz --d 1024
Replace $PYHON_DATASET
with the path to the Python project folders and $TOKENS
with the path to store
tokenized project files. Also, note that detected duplicate files will be stored in
the file py_dataset_duplicates.jsonl.gz
.
- The following code example shows the removal of duplicate files using the example file
py_dataset_duplicates.jsonl.gz
:
from dpu_utils.utils.dataloading import load_jsonl_gz
import random
# Selects randomly a file from each cluster of duplicate files
clusters_rand_files = [l.pop(random.randrange(len(l))) for l in load_jsonl_gz('py_dataset_duplicates.jsonl.gz')]
duplicate_files = [f for l in load_jsonl_gz('py_dataset_duplicates.jsonl.gz') for f in l]
duplicate_files = set(duplicate_files).difference(set(clusters_rand_files))
Approach
The CD4Py
code de-duplication tool uses the following procedure to identify duplicate files in a Python code corpus:
- Tokenize all the source code files in the code corpus using
tokenize
module of Python standard library. - Preprocess tokenized source files by only selecting identifier tokens and removing language keywords.
- Convert pre-processed tokenized files to a vector representation using the TF-IDF method.
- Perform
k
-nearest neighbor search to findk
candidate duplicate files for each source code file. Next, filter out candidate duplicate files by considering the thresholdt
. - Find clusters of duplicate source code files while assuming that similarity is transitive.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file cd4py-0.1.0.tar.gz
.
File metadata
- Download URL: cd4py-0.1.0.tar.gz
- Upload date:
- Size: 6.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.39.0 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1fbb7d8f759412dbf533aaf1a8673e922b422483cb488ec998952033f2d925a7 |
|
MD5 | c567a0ae893d80990fdf23dc83eecb3a |
|
BLAKE2b-256 | 71f426824aec8d409492ffd5c5711a620f5f93f7d406041b255c3cd9f8e7ecd1 |
File details
Details for the file cd4py-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: cd4py-0.1.0-py3-none-any.whl
- Upload date:
- Size: 20.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.39.0 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 078bb417bb6994764c5f5d546339ded14e0ce5521250fe46432678ec172dce9d |
|
MD5 | 34f6d67073f231ac78c31a72f542155b |
|
BLAKE2b-256 | e1fd1b1e40d146be59427599ceba35fdff54f56100f4f1ac12ba8e467a2f7025 |