Skip to main content

Find identical files in subdirectories

Project description

duplicates

Build Status

Scan for identical files (duplicates) in subdirectories.

Requirements

  • Python >= 3.6
  • MS Windows is not supported

Description

To find files with identical content the given directories will be scanned and for files of same size their SHA-256 fingerprints are calculated and compared. Two files with identical fingerprints are considered to have the same content. There is a tiny chance for two files with same fingerprint to have different content, but this chance is very remote.

Symbolic links and hidden entries are ignored by default, this behaviour can be changed with CLI options --follow/--hidden and constructor options ignore_hidden/ignore_symlinks.

CLI examples

This one will give you a short command overview:

$ duplicates --help

Scan directories dirA, dirB and dirC for duplicates and report all found identical files:

$ duplicates dirA dirB dirC

dirA/file01
        dirA/file01.bak
        dirB/file.bak
dirA/file02
        dirB/file02~

The oldest file is printed without indent, all identical files are printed indented by a tab character. The oldest file is supposed to be the original.

If you are willing to take risks, you can delete all duplicates at once. I wouldn't dare, but you get the picture:

$ duplicates --dups-only dirA dirB | while read dups ; do xargs -0 rm $dups ; done

With --dups-only all duplicates for one original are output on one line, separated by \0 (ASCII code zero).

For fish shell it looks almost identical:

$ duplicates --dups-only dirA dirB | while read -la dups ; xargs -0 rm $dups ; end

Python examples

import duplicates

df = duplicates.DupFinder(verbose=True)
uniq, dups = df.scan(".")

uniq is a list of unique file objects. dups is a list of identical files, which in turn are lists of file objects, the first being the oldest element and thus the supposed original.

A file object is a dict consisting of the following elements:

  • path: a pathlib.Path object
  • age: modification time in seconds (Unix time)
  • size: file size in bytes
  • hash: the SHA-256 fingerprint (not calculated for unique files)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

duplicates-0.1.0.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

duplicates-0.1.0-py3-none-any.whl (7.8 kB view details)

Uploaded Python 3

File details

Details for the file duplicates-0.1.0.tar.gz.

File metadata

  • Download URL: duplicates-0.1.0.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.7.4

File hashes

Hashes for duplicates-0.1.0.tar.gz
Algorithm Hash digest
SHA256 775bd8d96d169ba87f406c867cb6d7c2304e6aecc0e7efca9e1d91fda0f23675
MD5 71c17b8206a79c76df59884985bd092c
BLAKE2b-256 534499969aba6d2708b58bc02a0bfc02459db436e3322efe047a2a0efca8bf7e

See more details on using hashes here.

File details

Details for the file duplicates-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: duplicates-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 7.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.7.4

File hashes

Hashes for duplicates-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 01f5388ae4f4981d5aca6d313654b0bd2da1a05f9bdd0d86ab2212abc873da8c
MD5 3b843a41e973f7fe01cca9347425429d
BLAKE2b-256 0bd979fc2ef0c3a0aba49ee341ac7579c71a58eccf407cfd7b0b4a8f8b0b1215

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page