Skip to main content

Find identical files in subdirectories

Project description

duplicates

Scan for identical files (duplicates) in subdirectories.

Requirements

  • Python >= 3.11
  • POSIX (Linux, macOS); MS Windows is not supported.

Installation

$ uv tool install duplicates

Or, if you prefer pipx:

$ pipx install duplicates

Description

To find files with identical content, the given directories are scanned and files of the same size have their SHA-256 fingerprints compared. Two files with identical fingerprints are considered to have the same content. There is a tiny chance for two files with the same fingerprint to have different content, but that chance is very remote.

Large files (≥ 64 KiB) are first compared by a cheap "partial" SHA-256 over their first and last 4 KiB; only files that survive that prefilter are read in full. For collections of large near-duplicates (videos, archives) this avoids reading most of the data.

Symbolic links and hidden entries are ignored by default. This behavior can be changed with the CLI options --follow / --hidden or the constructor options ignore_symlinks / ignore_hidden.

CLI examples

Print a short command overview:

$ duplicates --help

Scan directories dirA, dirB and dirC and report identical files:

$ duplicates dirA dirB dirC

dirA/file01
        dirA/file01.bak
        dirB/file.bak
dirA/file02
        dirB/file02~

The oldest file is printed without indent; identical files are listed indented by a tab. The oldest file is treated as the original.

If you are willing to take risks, you can delete all duplicates at once. I wouldn't dare, but you get the picture:

$ duplicates --dups-only dirA dirB | while read dups ; do xargs -0 rm $dups ; done

With --dups-only, all duplicates for one original are printed on a single line separated by \0 (ASCII NUL).

For the fish shell the syntax is almost identical:

$ duplicates --dups-only dirA dirB | while read -la dups ; xargs -0 rm $dups ; end

JSON output

For scripted consumption, --json emits the full result on stdout including a statistics block with counts and the scan's elapsed time:

$ duplicates --json dirA dirB
{
  "scanned_paths": ["dirA", "dirB"],
  "duplicates": [
    {
      "hash": "sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
      "size": 1234,
      "files": [
        {"path": "dirA/file01", "age": 1700000000.0},
        {"path": "dirA/file01.bak", "age": 1710000000.0}
      ]
    }
  ],
  "statistics": {
    "total_files": 12,
    "unique_files": 10,
    "duplicate_groups": 1,
    "duplicate_copies": 1,
    "duplicate_bytes": 1234,
    "unreadable_files": 0,
    "elapsed_seconds": 0.0123
  }
}

--json is mutually exclusive with --dups-only and --summary. Combine with --unique to also include the unique files in the output.

Progress on long scans

--verbose surfaces phase markers and per-file logs to stderr — useful when running over a slow filesystem (SMB, large library) where the tool might otherwise look stuck:

INFO: Scanning 1 path(s)...
INFO: Scanned 1284 file(s) so far...
INFO: Discovered 4012 file(s) in 3987 size group(s)
INFO: Partial-hashing 12 file(s)...
INFO: Partial-hashing /films/movie.mp4 (5.2 GiB)
INFO: Full-hashing 4 file(s)...
INFO: Full-hashing /films/movie.mp4 (5.2 GiB)

--debug adds per-directory and per-ignored-entry messages on top.

Python API

from duplicates import DupFinder

uniq, dups, unreadable = DupFinder().scan(".")

uniq is a list of unique FileEntry objects. dups is a list of duplicate groups, where each group is a list of FileEntry objects with identical content. Use entry.age to identify the oldest file in a group. unreadable collects files that could not be fingerprinted (permission denied, I/O error); they cannot be classified and are returned separately instead of being silently dropped.

A FileEntry is a dataclass with the following fields:

  • path: a pathlib.Path
  • size: file size in bytes
  • age: modification time in seconds (Unix time)
  • hash: the SHA-256 fingerprint (None for unique files where no hash was needed)

Progress messages are emitted via the logging module on the duplicates logger; configure logging in your application to see them.

Development

$ uv sync
$ uv run pytest
$ uv run ruff check .
$ uv run basedpyright

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

duplicates-0.4.3.tar.gz (9.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

duplicates-0.4.3-py3-none-any.whl (10.7 kB view details)

Uploaded Python 3

File details

Details for the file duplicates-0.4.3.tar.gz.

File metadata

  • Download URL: duplicates-0.4.3.tar.gz
  • Upload date:
  • Size: 9.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for duplicates-0.4.3.tar.gz
Algorithm Hash digest
SHA256 041a3ee97761e3c529bce3bce5f3edb862ee76c8bb1a524328499bff8d671f2d
MD5 0625785e00389f684e177908ccb70100
BLAKE2b-256 1878b00af6eecb4a0f22bf8e9f8749ae247802dd1fd86e28898b348e560e90f1

See more details on using hashes here.

File details

Details for the file duplicates-0.4.3-py3-none-any.whl.

File metadata

  • Download URL: duplicates-0.4.3-py3-none-any.whl
  • Upload date:
  • Size: 10.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for duplicates-0.4.3-py3-none-any.whl
Algorithm Hash digest
SHA256 545e722ff8712bb217b7b609bfc3f1b3e00aec28c31987cdce325feee08807bc
MD5 d84cb788428a6cb72a99e25eacd1f550
BLAKE2b-256 7b85796f1ead3847b68b604e4779a6ac01dd2a82456b989b2d66cfecf738436c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page