Skip to main content

Find identical files in subdirectories

Project description

duplicates

Build Status

Scan for identical files (duplicates) in subdirectories.

Requirements

  • Python >= 3.6
  • MS Windows is not supported

Description

To find files with identical content the given directories will be scanned and for files of same size their SHA-256 fingerprints are calculated and compared. Two files with identical fingerprints are considered to have the same content. There is a tiny chance for two files with same fingerprint to have different content, but this chance is very remote.

Symbolic links and hidden entries are ignored by default, this behaviour can be changed with CLI options --follow/--hidden and constructor options ignore_hidden/ignore_symlinks.

CLI examples

This one will give you a short command overview:

$ duplicates --help

Scan directories dirA, dirB and dirC for duplicates and report all found identical files:

$ duplicates dirA dirB dirC

dirA/file01
        dirA/file01.bak
        dirB/file.bak
dirA/file02
        dirB/file02~

The oldest file is printed without indent, all identical files are printed indented by a tab character. The oldest file is supposed to be the original.

If you are willing to take risks, you can delete all duplicates at once. I wouldn't dare, but you get the picture:

$ duplicates --dups-only dirA dirB | while read dups ; do xargs -0 rm $dups ; done

With --dups-only all duplicates for one original are output on one line, separated by \0 (ASCII code zero).

For fish shell it looks almost identical:

$ duplicates --dups-only dirA dirB | while read -la dups ; xargs -0 rm $dups ; end

Python examples

import duplicates

df = duplicates.DupFinder(verbose=True)
uniq, dups = df.scan(".")

uniq is a list of unique file objects. dups is a list of identical files, which in turn are lists of file objects, the first being the oldest element and thus the supposed original.

A file object is a dict consisting of the following elements:

  • path: a pathlib.Path object
  • age: modification time in seconds (Unix time)
  • size: file size in bytes
  • hash: the SHA-256 fingerprint (not calculated for unique files)

Project details


Release history Release notifications

This version

0.1.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for duplicates, version 0.1.0
Filename, size File type Python version Upload date Hashes
Filename, size duplicates-0.1.0-py3-none-any.whl (7.8 kB) File type Wheel Python version py3 Upload date Hashes View hashes
Filename, size duplicates-0.1.0.tar.gz (6.0 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page