Find identical files in subdirectories
Project description
duplicates
Scan for identical files (duplicates) in subdirectories.
Requirements
- Python >= 3.6
- MS Windows is not supported
Description
To find files with identical content the given directories will be scanned and for files of same size their SHA-256 fingerprints are calculated and compared. Two files with identical fingerprints are considered to have the same content. There is a tiny chance for two files with same fingerprint to have different content, but this chance is very remote.
Symbolic links and hidden entries are ignored by default, this behaviour can
be changed with CLI options --follow
/--hidden
and constructor options
ignore_hidden
/ignore_symlinks
.
CLI examples
This one will give you a short command overview:
$ duplicates --help
Scan directories dirA
, dirB
and dirC
for duplicates and report all found
identical files:
$ duplicates dirA dirB dirC
dirA/file01
dirA/file01.bak
dirB/file.bak
dirA/file02
dirB/file02~
The oldest file is printed without indent, all identical files are printed indented by a tab character. The oldest file is supposed to be the original.
If you are willing to take risks, you can delete all duplicates at once. I wouldn't dare, but you get the picture:
$ duplicates --dups-only dirA dirB | while read dups ; do xargs -0 rm $dups ; done
With --dups-only
all duplicates for one original are output on one line,
separated by \0
(ASCII code zero).
For fish shell it looks almost identical:
$ duplicates --dups-only dirA dirB | while read -la dups ; xargs -0 rm $dups ; end
Python examples
import duplicates
df = duplicates.DupFinder(verbose=True)
uniq, dups = df.scan(".")
uniq
is a list of unique file objects. dups
is a list of identical files,
which in turn are lists of file objects, the first being the oldest element
and thus the supposed original.
A file object is a dict consisting of the following elements:
path
: a pathlib.Path objectage
: modification time in seconds (Unix time)size
: file size in byteshash
: the SHA-256 fingerprint (not calculated for unique files)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file duplicates-0.1.0.tar.gz
.
File metadata
- Download URL: duplicates-0.1.0.tar.gz
- Upload date:
- Size: 6.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 775bd8d96d169ba87f406c867cb6d7c2304e6aecc0e7efca9e1d91fda0f23675 |
|
MD5 | 71c17b8206a79c76df59884985bd092c |
|
BLAKE2b-256 | 534499969aba6d2708b58bc02a0bfc02459db436e3322efe047a2a0efca8bf7e |
File details
Details for the file duplicates-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: duplicates-0.1.0-py3-none-any.whl
- Upload date:
- Size: 7.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 01f5388ae4f4981d5aca6d313654b0bd2da1a05f9bdd0d86ab2212abc873da8c |
|
MD5 | 3b843a41e973f7fe01cca9347425429d |
|
BLAKE2b-256 | 0bd979fc2ef0c3a0aba49ee341ac7579c71a58eccf407cfd7b0b4a8f8b0b1215 |