Skip to main content

Find identical files in subdirectories

Project description

duplicates

Scan for identical files (duplicates) in subdirectories.

Requirements

  • Python >= 3.11
  • POSIX (Linux, macOS); MS Windows is not supported.

Installation

$ uv tool install duplicates

Or, if you prefer pipx:

$ pipx install duplicates

Description

To find files with identical content, the given directories are scanned and files of the same size have their SHA-256 fingerprints compared. Two files with identical fingerprints are considered to have the same content. There is a tiny chance for two files with the same fingerprint to have different content, but that chance is very remote.

Large files (≥ 64 KiB) are first compared by a cheap "partial" SHA-256 over their first and last 4 KiB; only files that survive that prefilter are read in full. For collections of large near-duplicates (videos, archives) this avoids reading most of the data.

Symbolic links and hidden entries are ignored by default. This behavior can be changed with the CLI options --follow / --hidden or the constructor options ignore_symlinks / ignore_hidden.

CLI examples

Print a short command overview:

$ duplicates --help

Scan directories dirA, dirB and dirC and report identical files:

$ duplicates dirA dirB dirC

dirA/file01
        dirA/file01.bak
        dirB/file.bak
dirA/file02
        dirB/file02~

The oldest file is printed without indent; identical files are listed indented by a tab. The oldest file is treated as the original.

If you are willing to take risks, you can delete all duplicates at once. I wouldn't dare, but you get the picture:

$ duplicates --dups-only dirA dirB | while read dups ; do xargs -0 rm $dups ; done

With --dups-only, all duplicates for one original are printed on a single line separated by \0 (ASCII NUL).

For the fish shell the syntax is almost identical:

$ duplicates --dups-only dirA dirB | while read -la dups ; xargs -0 rm $dups ; end

Python API

from duplicates import DupFinder

uniq, dups, unreadable = DupFinder().scan(".")

uniq is a list of unique FileEntry objects. dups is a list of duplicate groups, where each group is a list of FileEntry objects with identical content. Use entry.age to identify the oldest file in a group. unreadable collects files that could not be fingerprinted (permission denied, I/O error); they cannot be classified and are returned separately instead of being silently dropped.

A FileEntry is a dataclass with the following fields:

  • path: a pathlib.Path
  • size: file size in bytes
  • age: modification time in seconds (Unix time)
  • hash: the SHA-256 fingerprint (None for unique files where no hash was needed)

Progress messages are emitted via the logging module on the duplicates logger; configure logging in your application to see them.

Development

$ uv sync
$ uv run pytest
$ uv run ruff check .
$ uv run basedpyright

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

duplicates-0.3.1.tar.gz (7.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

duplicates-0.3.1-py3-none-any.whl (8.7 kB view details)

Uploaded Python 3

File details

Details for the file duplicates-0.3.1.tar.gz.

File metadata

  • Download URL: duplicates-0.3.1.tar.gz
  • Upload date:
  • Size: 7.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for duplicates-0.3.1.tar.gz
Algorithm Hash digest
SHA256 94567df46e6c1e776ec52a4ca5770785fb9420143ec255672b8155f6b719c0a8
MD5 9a9dbd1f7b74a4579004fc8ec52f07cc
BLAKE2b-256 0b34c228dc910dfc2f6fa9d8553bbba2cae5d01d196f4da070e980c273c0baec

See more details on using hashes here.

File details

Details for the file duplicates-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: duplicates-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 8.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for duplicates-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b12b6e08b34d42e6216d86221436d419fa57c6bcb3636a9b378c8de2486be893
MD5 c74755099808d71e46e76c1c55e3a328
BLAKE2b-256 bb60750a399b1eb83dbe26d9bcb1853e6aa018444e7e6681dcbceee7407feb1a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page