A Python package to search for and remove duplicated files in messy datasets

These details have not been verified by PyPI

Project links

Homepage

Project description

deduplify

A Python tool to search for and remove duplicated files in messy datasets.

Table of Contents:

Overview
Installation
- From PyPI
- Manual Installation
Usage
Contributing

Overview

deduplify is a Python command line tool that will search a directory tree for duplicated files and optionally remove them. It generates an MD5 hash for each file recursively under a target directory and identifies the filepaths that generate unique and duplicated hashes. When deleting duplicated files, it deletes those deepest in the directory tree first leaving the last present.

Installation

deduplify has a minimum Python requirement of v3.7 but has been developed in v3.8.

From PyPI

First, make sure your pip version is up-to-date.

python -m pip install --upgrade pip

Then install deduplify.

pip install deduplify

Manual Installation

Begin by cloning this repository and change into it.

git clone https://github.com/Living-with-machines/deduplify.git
cd deduplify

Now run the setup script. This will install any requirements and the CLI tool into your Python $PATH.

python setup.py install

Usage

deduplify has 3 commands: hash, compare and clean.

Hashing files

The hash command takes a path to a target directory as an argument. It walks the structure of this directory tree and generates MD5 hashes for all files and outputs a database stored as a JSON file, the name of which can be overwritten using the --dbfile [-f] flag.

Each document in the generated database can be described as a dictionary with the following properties:

{
  "filepath": "",     # String. The full path to a given file.
  "hash": "",         # String. The MD5 hash of the given file.
  "duplicate": bool,  # Boolean. Whether this hash is repeated in the database (True) or not (False).
}

By default, deduplify generates hashes for all files under a directory. But one or more specific file extensions to search for can be specified using the --ext flag.

Command line usage:

usage: deduplify hash [-h] [-c COUNT] [-v] [-f DBFILE] [--exts [EXTS]] [--restart] dir

positional arguments:
  dir                   Path to directory to begin search from

optional arguments:
  -h, --help            show this help message and exit
  -c COUNT, --count COUNT
                        Number of threads to parallelise over. Default: 1
  -v, --verbose         Print logging messages to the console
  -f DBFILE, --dbfile DBFILE
                        Destination database for file hashes. Must be a JSON file. Default: file_hashes.json
  --exts [EXTS]         A list of file extensions to search for.
  --restart             Restart a run of hashing files and skip over files that have already been hashed. Output file containing a database of
                        filenames and hashes must already exist.

Comparing files

The compare command reads in the JSON database generated by running hash, the name of which can be overwritten using the --infile [-f] flag if the data were saved under a different name. The command runs a check to test if the stem of the filepath are equivalent for all paths that generated a given hash. This indicates that the file is a true duplication as since both its name and content match. If they do not match, this implies that the same content is saved under two different filenames. In this scenario, a warning is raised asking the user to manually investigate these files.

If all the filenames for a given hash match, then the shortest filepath is removed from the list and the rest are returned to be deleted. To delete files, the user needs to run compare with the --purge flag set.

A recommended workflow to ensure that all duplicated files have been removed would be as follows:

deduplify hash target_dir  # First pass at hashing files
deduplify compare --purge  # Delete duplicated files
deduplify hash target_dir  # Second pass at hashing files
deduplify compare          # Compare the filenames again. The code should return nothing to compare

Command line usage:

usage: deduplify compare [-h] [-c COUNT] [-v] [-f INFILE] [--list-files] [--purge]

optional arguments:
  -h, --help            show this help message and exit
  -c COUNT, --count COUNT
                        Number of threads to parallelise over. Default: 1
  -v, --verbose         Print logging messages to the console
  -f INFILE, --infile INFILE
                        Database to analyse. Must be a JSON file. Default: file_hashes.json
  --list-files          List duplicated files. Default: False
  --purge               Deletes duplicated files. Default: False

Cleaning up

After purging duplicated files, the target directory may be left with empty sub-directories. Running the clean command will locate and delete these empty subdirs and remove them.

Command line usage:

usage: deduplify clean [-h] [-c COUNT] [-v] dir

positional arguments:
  dir                   Path to directory to begin search from

optional arguments:
  -h, --help            show this help message and exit
  -c COUNT, --count COUNT
                        Number of threads to parallelise over. Default: 1
  -v, --verbose         Print logging messages to the console

Global arguments

The following flags can be passed to any of the commands of deduplify.

--verbose [-v]: The flag will print verbose output to the console, as opposed to saving it to the deduplify.log file.
--count [-c]: Some processes within deduplify can be parallelised over multiple threads when working with larger datasets. To do this, include the --count flag with the (integer) number of threads you'd like to parallelise over. This flag will raise an error if requesting more threads than CPUs available on the host machine.

Contributing

Thank you for wanting to contribute to deduplify! :tada: :sparkling_heart: To get you started, please read our Code of Conduct and Contributing Guidelines.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

20.9.0 yanked

Sep 11, 2020

Reason this release was yanked:

Incorrect versioning

This version

0.5.0

Apr 24, 2022

0.4.2

Mar 9, 2022

0.4.1

Mar 6, 2022

0.4.0

Feb 28, 2022

0.3.0

Feb 28, 2022

0.2.0

Feb 26, 2022

0.1.5

Feb 26, 2022

0.1.4

Feb 26, 2022

0.1.3

Feb 26, 2022

0.1.2

Oct 1, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deduplify-0.5.0.tar.gz (12.2 kB view details)

Uploaded Apr 24, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

deduplify-0.5.0-py3-none-any.whl (11.4 kB view details)

Uploaded Apr 24, 2022 Python 3

File details

Details for the file deduplify-0.5.0.tar.gz.

File metadata

Download URL: deduplify-0.5.0.tar.gz
Upload date: Apr 24, 2022
Size: 12.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.0 CPython/3.9.12

File hashes

Hashes for deduplify-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`91c23f348bf4a5c46d33535388827e10872fb328d567f5c81a6f0629262ac94f`
MD5	`d6fa8011b2a1e459a8fd3ff2c5c80bb7`
BLAKE2b-256	`309518145ab4d547784bdd32df8775f281a761e66ad70a07237d76c9cbc2d315`

See more details on using hashes here.

File details

Details for the file deduplify-0.5.0-py3-none-any.whl.

File metadata

Download URL: deduplify-0.5.0-py3-none-any.whl
Upload date: Apr 24, 2022
Size: 11.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.0 CPython/3.9.12

File hashes

Hashes for deduplify-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6d4508981367b6c6947d945347323e062d769fe02602172f305f30e8f050c8c1`
MD5	`43a2efee719d0827aaa017ba3a73afe3`
BLAKE2b-256	`427f974127b0ea7a92d4ef6574567b3349b80c8cea2194f7293b8f0afe9939f1`

See more details on using hashes here.

deduplify 0.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

deduplify

Overview

Installation

From PyPI

Manual Installation

Usage

Hashing files

Comparing files

Cleaning up

Global arguments

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes