Skip to main content

A Python package to search for and remove duplicated files in messy datasets

Reason this release was yanked:

Incorrect versioning

Project description

deduplify

CI

A Python tool to search for and remove duplicated files in messy datasets.

Table of Contents:


Overview

deduplify is a Python command line tool that will search a directory tree for duplicated files and optionally remove them. It generates an MD5 hash for each file recursively under a target directory, groups together the filepaths that generate unique and duplicated hashes. When deleting duplicated files, it deletes those deepest in the directory tree first leaving the last present.

Installation

deduplify has a minimum Python requirement of v3.7 but has been developed in v3.8.

Manual Installation

Begin by cloning this repository and change into it.

git clone https://github.com/Living-with-machines/deduplify.git
cd deduplify

Now run the setup script. This will install any requirements and the CLI tool into your Python $PATH.

python setup.py install

Usage

deduplify has 3 commands: hash, compare and clean.

Hashing files

The hash command takes a path to a target directory as an argument. It walks the structure of this directory tree and generates MD5 hashes for all files and outputs two JSON files, the names of which can be overwritten using the --dupfile [-d] and --unfile [-u] flags.

One JSON file contains hashes that are considered "unique" since only one filepath generated this hash. This file is organised such that the keys are the hashes and the values are the filepaths that generated the hashes.

The second JSON file contains hashes that are considered "duplicated" since more than one filepath generated the same hash. This file is organised such that the keys are the hashes and the values are a list of the filepaths that generated the duplicated hashes.

Command line usage:

usage: deduplify hash [-h] [-c COUNT] [-v] [-d DUPFILE] [-u UNFILE] dir

positional arguments:
  dir                   Path to directory to begin search from

optional arguments:
  -h, --help            show this help message and exit
  -c COUNT, --count COUNT
                        Number of threads to parallelise over. Default: 1
  -v, --verbose         Print logging messages to the console
  -d DUPFILE, --dupfile DUPFILE
                        Destination file for duplicated hashes. Must be a JSON file. Default: duplicates.json
  -u UNFILE, --unfile UNFILE
                        Destination file for unique hashes. Must be a JSON file. Default: uniques.json

Comparing files

The compare command reads in the JSON file of duplicates generated by running hash, the name of which can be overwritten using the --infile [-i] flag if the data were saved under a different name. The command runs a check to test if the stem of the filepath are equivalent for all paths that generated a given hash. This indicates that the file is a true duplication as since both its name and content match. If they do not match, this implies that the same content is saved under two different filenames. In this scenario, a ValueError is raised and the user is asked to manually investigate these files.

If all the filenames for a given hash match, then the shortest filepath is removed from the list and the rest are returned to be deleted. To delete files, the user needs to run compare with the --purge flag set.

A recommended workflow to ensure that all duplicated files have been removed would be as follows:

deduplify hash target_dir  # First pass at hashing files
deduplify compare --purge  # Delete duplicated files
deduplify hash target_dir  # Second pass at hashing files
deduplify compare          # Compare the filenames again. The code should return nothing to compare

Command line usage:

usage: deduplify compare [-h] [-c COUNT] [-v] [-i INFILE] [--purge]

optional arguments:
  -h, --help            show this help message and exit
  -c COUNT, --count COUNT
                        Number of threads to parallelise over. Default: 1
  -v, --verbose         Print logging messages to the console
  -i INFILE, --infile INFILE
                        Filename to analyse. Must be a JSON file. Default: duplicates.json
  --purge               Deletes duplicated files. Default: False

Cleaning up

After purging duplicated files, the target directory may be left with empty sub-directories. Running the clean command will locate and delete these empty subdirs and remove them.

Command line usage:

usage: deduplify clean [-h] [-c COUNT] [-v] dir

positional arguments:
  dir                   Path to directory to begin search from

optional arguments:
  -h, --help            show this help message and exit
  -c COUNT, --count COUNT
                        Number of threads to parallelise over. Default: 1
  -v, --verbose         Print logging messages to the console

Global arguments

The following flags can be passed to any of the commands of deduplify.

  • --verbose [-v]: The flag will print verbose output to the console, as opposed to saving it to the deduplify.log file.
  • --count [-c]: Some processes within deduplify can be parallelised over multiple threads when working with larger datasets. To do this, include the --count flag with the (integer) number of threads you'd like to parallelise over. This flag will raise an error if requesting more threads than CPUs available on the host machine.

Contributing

Thank you for wanting to contribute to deduplify! :tada: :sparkling_heart: To get you started, please read our Code of Conduct and Contributing Guidelines.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deduplify-20.9.0.tar.gz (10.5 kB view details)

Uploaded Source

Built Distributions

deduplify-20.9.0-py3.8.egg (15.0 kB view details)

Uploaded Source

deduplify-20.9.0-py3-none-any.whl (10.1 kB view details)

Uploaded Python 3

File details

Details for the file deduplify-20.9.0.tar.gz.

File metadata

  • Download URL: deduplify-20.9.0.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for deduplify-20.9.0.tar.gz
Algorithm Hash digest
SHA256 a1aa3ea5284aa4c4f7d7fba20ae73b7847e4440c33ddf6f98111cd5f5cfe1122
MD5 dd518c02ffb04d795b4c40605ddf0ecd
BLAKE2b-256 ce12028a521886e5fc2d8bce4066b1db9be47835dfeb2d0bf870d3e045389416

See more details on using hashes here.

File details

Details for the file deduplify-20.9.0-py3.8.egg.

File metadata

  • Download URL: deduplify-20.9.0-py3.8.egg
  • Upload date:
  • Size: 15.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for deduplify-20.9.0-py3.8.egg
Algorithm Hash digest
SHA256 bd5928c74e273991b34aeb5d69bbf18fd47cba2475865fc42d555b0b2d612fbd
MD5 0f1287e17f93a1c8faf5d7e051f210ea
BLAKE2b-256 f6c22554b23122b0fe9e7b2180a9ce674daeeb735f0130c71827c2a61275bbc2

See more details on using hashes here.

File details

Details for the file deduplify-20.9.0-py3-none-any.whl.

File metadata

  • Download URL: deduplify-20.9.0-py3-none-any.whl
  • Upload date:
  • Size: 10.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for deduplify-20.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cedf42de7a5c20f928eebc23799f19a9a296194f9861e3d913ef82326cf46f86
MD5 486441d798905ce0368437e12e0720dd
BLAKE2b-256 e54e3a950bf397a6b7d9ef139675c75854688af3d87665c5eea0a9a3ded3b54b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page