Skip to main content

Find identical files in subdirectories

Project description

duplicates

Scan for identical files (duplicates) in subdirectories.

Requirements

  • Python >= 3.11
  • POSIX (Linux, macOS); MS Windows is not supported.

Installation

$ uv tool install duplicates

Or, if you prefer pipx:

$ pipx install duplicates

Description

To find files with identical content, the given directories are scanned and files of the same size have their SHA-256 fingerprints compared. Two files with identical fingerprints are considered to have the same content. There is a tiny chance for two files with the same fingerprint to have different content, but that chance is very remote.

Symbolic links and hidden entries are ignored by default. This behavior can be changed with the CLI options --follow / --hidden or the constructor options ignore_symlinks / ignore_hidden.

CLI examples

Print a short command overview:

$ duplicates --help

Scan directories dirA, dirB and dirC and report identical files:

$ duplicates dirA dirB dirC

dirA/file01
        dirA/file01.bak
        dirB/file.bak
dirA/file02
        dirB/file02~

The oldest file is printed without indent; identical files are listed indented by a tab. The oldest file is treated as the original.

If you are willing to take risks, you can delete all duplicates at once. I wouldn't dare, but you get the picture:

$ duplicates --dups-only dirA dirB | while read dups ; do xargs -0 rm $dups ; done

With --dups-only, all duplicates for one original are printed on a single line separated by \0 (ASCII NUL).

For the fish shell the syntax is almost identical:

$ duplicates --dups-only dirA dirB | while read -la dups ; xargs -0 rm $dups ; end

Python API

from duplicates import DupFinder

uniq, dups, unreadable = DupFinder().scan(".")

uniq is a list of unique FileEntry objects. dups is a list of duplicate groups, where each group is a list of FileEntry objects with identical content. Use entry.age to identify the oldest file in a group. unreadable collects files that could not be fingerprinted (permission denied, I/O error); they cannot be classified and are returned separately instead of being silently dropped.

A FileEntry is a dataclass with the following fields:

  • path: a pathlib.Path
  • size: file size in bytes
  • age: modification time in seconds (Unix time)
  • hash: the SHA-256 fingerprint (None for unique files where no hash was needed)

Progress messages are emitted via the logging module on the duplicates logger; configure logging in your application to see them.

Development

$ uv sync
$ uv run pytest
$ uv run ruff check .
$ uv run basedpyright

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

duplicates-0.3.0.tar.gz (6.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

duplicates-0.3.0-py3-none-any.whl (7.7 kB view details)

Uploaded Python 3

File details

Details for the file duplicates-0.3.0.tar.gz.

File metadata

  • Download URL: duplicates-0.3.0.tar.gz
  • Upload date:
  • Size: 6.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for duplicates-0.3.0.tar.gz
Algorithm Hash digest
SHA256 b865303d9eb80173e1671254abe3c1640fad0b70ba1b018ce71d9753f25bdc2c
MD5 0fa3d0df1bba9b6ad958df1a7cc28864
BLAKE2b-256 95106a33dfa37489550fe7fa2cf0f65b45c0601c820c9ee37fe47c796ffc5c17

See more details on using hashes here.

File details

Details for the file duplicates-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: duplicates-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 7.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for duplicates-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 af8dad7f125f01495f524e71028dc7cb34b996dc908d7d211abcbd2ec9028f49
MD5 35bcabc909542f3c441b33ed969232c0
BLAKE2b-256 91c70b37816aa77a9d13e588720001e69f16d0fe4313458c75b2916e8b78a341

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page