Skip to main content

Yet another file deduplicator

Project description

Deduplidog โ€“ Deduplicator that covers your back.

Build Status

About

What are the use cases?

  • I have downloaded photos and videos from the cloud. Oh, both Google Photos and Youtube shrink the files and change their format. Moreover, they shorten the file names to 47 characters and capitalize the extensions. So how am I supposed to know if I have everything backed up offline when the copies are resized?
  • My disk is cluttered with several backups and I'd like to be sure these are all just copies.
  • I merge data from multiple sources. Some files in the backup might have the former orignal file modification date that I might wish to restore.

What is compared?

  • The file name.

Works great when the files keep more or less the same name. (Photos downloaded from Google have its stem shortened to 47 chars but that is enough.) Might ignore case sensitivity.

  • The file date.

You can impose the same file mtime, tolerate few hours (to correct timezone confusion) or ignore the date altogether.

Note: we ignore smaller than a second differences.

  • The file size, the image hash or the video frame count.

The file must have the same size. Or take advantage of the media magic under the hood which ignores the file size but compares the image or the video inside. It is great whenever you end up with some files converted to a different format.

  • The contents?

You may use checksum=True to perform CRC32 check. However for byte-to-byte checking, when the file names might differ or you need to check there is no byte corruption, some other tool might be better way, i.e. jdupes.

Why not using standard sync tools like meld?

These imply the folders have the same structure. Deduplidog is tolerant towards files scattered around.

Doubts?

The program does not write anything to the disk, unless execute=True is set. Feel free to launch it just to inspect the recommended actions. Or set inspect=True to output bash commands you may launch after thorough examining.

Launch

Install with pip install deduplidog.

It works as a standalone program with all the CLI, TUI and GUI interfaces. Just launch the deduplidog command.

Examples

Media magic confirmation

Let's compare two folders.

deduplidog --work-dir folder1 --original-dir folder2  --media-magic --rename --execute

By default, --confirm-one-by-one is True, causing every change to be manually confirmed before it takes effect. So even though --execute is there, no change happen without confirmation. The change that happen is the --rename, the file in the --work-dir will be prefixed with the โœ“ character. The --media-magic mode considers an image a duplicate if it has the same name and a similar image hash, even if the files are of different sizes.

Confirmation

Note that the default button is 'No' as there are some warnings. First, the file in the folder we search for duplicates in is bigger than the one in the original folder. Second, it is also older, suggesting that it might be the actual original.

Duplicated files

Let's take a closer look to a use-case.

deduplidog --work-dir /home/user/duplicates --original-dir /media/disk/origs" --ignore-date --rename

This command produced the following output:

Find files by size, ignoring: date, crc32
Duplicates from the work dir at 'home' would be (if execute were True) renamed (prefixed with โœ“).
Number of originals: 38
* /home/user/duplicates/foo.txt
  /media/disk/origs/foo.txt
  ๐Ÿ”จhome: renamable
  ๐Ÿ“„media: DATE WARNING + a day ๐Ÿ›Ÿskipped on warning
Affectable: 37/38
Affected size: 56.9 kB
Warnings: 1

We found out all the files in the duplicates folder seem to be useless but one. It's date is earlier than the original one. The life buoy icon would prevent any action. To suppress this, let's turn on set_both_to_older_date. See with full log.

deduplidog --work-dir /home/user/duplicates --original-dir /media/disk/origs --ignore-date --rename --set-both-to-older-date --log-level=10
Find files by size, ignoring: date, crc32
Duplicates from the work dir at 'home' would be (if execute were True) renamed (prefixed with โœ“).
Original file mtime date might be set backwards to the duplicate file.
Number of originals: 38
* /home/user/duplicates/foo.txt
  /media/disk/origs/foo.txt
  ๐Ÿ”จhome: renamable
  ๐Ÿ“„media: redatable 2022-04-28 16:58:56 -> 2020-04-26 16:58:00
* /home/user/duplicates/bar.txt
  /media/disk/origs/bar.txt
  ๐Ÿ”จhome: renamable
* /home/user/duplicates/third.txt
  /media/disk/origs/third.txt
  ๐Ÿ”จhome: renamable
  ...
Affectable: 38/38
Affected size: 59.9 kB

You see, the log is at the most brief, yet transparent form. The files to be affected at the work folder are prepended with the ๐Ÿ”จ icon whereas those affected at the original folder uses ๐Ÿ“„ icon. We might add execute=True parameter to perform the actions. Or use inspect=True to inspect.

deduplidog --work-dir /home/user/duplicates --original-dir /media/disk/origs --ignore-date --rename --set-both-to-older-date --inspect

The inspect=True just produces the commands we might subsequently use.

touch -t 1524754680.0 /media/disk/origs/foo.txt
mv -n /home/user/duplicates/foo.txt /home/user/duplicates/โœ“foo.txt
mv -n /home/user/duplicates/bar.txt /home/user/duplicates/โœ“bar.txt
mv -n /home/user/duplicates/third.txt /home/user/duplicates/โœ“third.txt

Names shuffled

You face a directory that might contain some images twice. Let's analyze. We turn on media_magic so that we find the scaled down images. We ignore_name because the scaled images might have been renamed. We skip_bigger files as we examine the only folder and every file pair would be matched twice. That way, we declare the original image is the bigger one. And we set log_level verbosity so that we get a list of the affected files.

$ deduplidog --work-dir ~/shuffled/ --media-magic --ignore-name --skip-bigger --log-level=20
Only files with media suffixes are taken into consideration. Nor the size nor the date is compared. Nor the name!
Duplicates from the work dir at 'shuffled' (only if smaller than the pair file) would be (if execute were True) left intact (because no action is selected, nothing will happen).

Number of originals: 9
Caching image hashes: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 9/9 [00:00<00:00, 16.63it/s]
Caching working files: 9it [00:00, 62497.91it/s]
* /home/user/shuffled/IMG_20230802_shrink.jpg
  /home/user/shuffled/IMG_20230802.jpg
Affectable: 1/9
Affected size: 636.4 kB

We see there si a single duplicated file whose name is IMG_20230802_shrink.jpg.

Documentation

See the docs overview at https://cz-nic.github.io/deduplidog/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deduplidog-0.7.0.tar.gz (20.4 kB view details)

Uploaded Source

Built Distribution

deduplidog-0.7.0-py3-none-any.whl (20.1 kB view details)

Uploaded Python 3

File details

Details for the file deduplidog-0.7.0.tar.gz.

File metadata

  • Download URL: deduplidog-0.7.0.tar.gz
  • Upload date:
  • Size: 20.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for deduplidog-0.7.0.tar.gz
Algorithm Hash digest
SHA256 8f12350ef5a326ad3191c3abe33c874f872313551c5236fd7bb6b416a0682f7f
MD5 f2c6a00fc42c50a0fa5ca24ddfe49d32
BLAKE2b-256 3b22f85c223d2c64c01cb7951e1172271639054f453af8dd04fdab58bcc1bb6e

See more details on using hashes here.

File details

Details for the file deduplidog-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: deduplidog-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 20.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for deduplidog-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b49b77b1521fe13430e58022563442dc510aa8198314608a9fb18456a9dbf93e
MD5 de5d4ebf269cc74dc2e2bddc1bc52a47
BLAKE2b-256 72aff102a2910f15802a7b98ef45c11f803bce07b79f1bd259381c1584f8ccf7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page