Skip to main content

Duplicate file finder

Project description

Info

Dedup is a cross-platform command-line Python application that is designed to efficiently detect and report duplicate files on your system.

Requirements

  • Linux (fully tested), Windows, MacOS
  • Python > 3.6
  • 3rd party module free ✅

Installation

git clone https://github.com/brightio/dedup
or
wget https://raw.githubusercontent.com/brightio/dedup/main/dedup.py

Modes

➤ Normal mode

It detects duplicate files in the given directories and/or files.

Screenshot from 2024-01-08 20-48-11

➤ Target mode

It detects if the given directories and/or files exist in the target directories and/or files which can be specified with -t.

Screenshot from 2024-01-08 20-48-21

File treatment

  • Empty files are excluded.
  • Symbolic links are not followed.
  • Hard links are considered to be the same file.
  • Hidden files and directories are excluded (they can be included with -a, -hf, -hd)
  • The duplicate sets are sorted by the space that will be freed if the duplicate files are removed (use -s to sort by individual file size)
  • The hashing algorithm to detect duplicate files is the SHA1. Further verification by typing 'v' in the interactive menu which will verify the results using MD5.

Item filtering

  • Use -min and -max for minimum and maximum file size respectively. The size can be specified like 500K, 2M, 10G etc.
  • Use -xf and -xd to exclude files and directories respectively. The value will be treated as a regular expression. Note: More elaborate filtering can be achieved via external programs such as 'find', as 'dedup' accepts newline separated item list from stdin.

Command line options

usage: dedup.py [-h] [-t TARGETS] [-s] [-u] [-S] [-I] [-V] [-xf EXCLUDE_FILES] [-xd EXCLUDE_DIRECTORIES] [-min MIN_SIZE] [-max MAX_SIZE] [-a] [-hf] [-hd] [-v]
                [ITEMS ...]

This program detects duplicate files.

positional arguments:
  ITEMS                 Files/Directories to detect duplicates

options:
  -h, --help            show this help message and exit
  -t TARGETS, --targets TARGETS
                        Files/Directories that we want to check if the ITEMS exist in there
  -s, --sort-size       Sort duplicates by size (Default: Saving size)
  -u, --show-unique     Show also unique files (Default: No)
  -S, --only-stats      Show only statistics
  -I, --non-interactive
                        Disable interactive prompts (Default: Enabled)
  -V, --verbose         Show files while they are being read
  -xf EXCLUDE_FILES, --exclude-files EXCLUDE_FILES
                        Files to exclude (regex)
  -xd EXCLUDE_DIRECTORIES, --exclude-directories EXCLUDE_DIRECTORIES
                        Directories to exclude (regex)
  -min MIN_SIZE, --min-size MIN_SIZE
                        Ommit files smaller than SIZE (Bytes).
  -max MAX_SIZE, --max-size MAX_SIZE
                        Ommit files larger than SIZE (Bytes).
  -a, --include-hidden  Include hidden files and directories (Default: No)
  -hf, --include-hidden-files
                        Include hidden files (Default: No)
  -hd, --include-hidden-directories
                        Include hidden directories (Default: No)
  -v, --version         Show version

TODO

  • Improve duplicate detection performance and interactive menu navigation.
  • Ability to save session, TAB delimited output and a file with the files to be deleted.
  • Ability to look into archive/zipped files.
  • Detect duplicate directories.
  • Stop hashing candidate duplicate files if at some point their data are different. This will save time with large files (like disk images) where their sizes are the same but their data differ.

Known Issues

  • Ctrl-C for stopping the program while searching for duplicates doesn't work on Windows yet.
  • Exiting the program on MacOS produce a warning like: /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 9 leaked semaphore objects to clean up at shutdown which I can't solve yet.

Contribution

If you want to contribute to this project please report bugs, unexpected program behaviours and/or new ideas.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dedup_kit_ng-0.8.1.tar.gz (49.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dedup_kit_ng-0.8.1-py3-none-any.whl (36.6 kB view details)

Uploaded Python 3

File details

Details for the file dedup_kit_ng-0.8.1.tar.gz.

File metadata

  • Download URL: dedup_kit_ng-0.8.1.tar.gz
  • Upload date:
  • Size: 49.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dedup_kit_ng-0.8.1.tar.gz
Algorithm Hash digest
SHA256 7fc8cd90193086676370298cac6e2c4ee25065e2d3518a56da3b0dd5279107e4
MD5 7b12fd109f822e5d93591a126fc079a6
BLAKE2b-256 7c8f25328f0fc7971a87d452b9ab32f95c5aebc9603d5013363d019f1a21a7cc

See more details on using hashes here.

File details

Details for the file dedup_kit_ng-0.8.1-py3-none-any.whl.

File metadata

  • Download URL: dedup_kit_ng-0.8.1-py3-none-any.whl
  • Upload date:
  • Size: 36.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dedup_kit_ng-0.8.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e31048b21af4eb17ca9005e4d14829ba6848e27e89f17e906789fc3a3a6c6d11
MD5 d1745a1276f81699735dd4d94ec7a6c7
BLAKE2b-256 d599f01c23ee8d98c673b8bcba8e0a8b451ac5f43df31770cc34bfc785cc3261

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page