Skip to main content

Find, review, and safely trash duplicate and near-duplicate files

Project description

tdupes

Smartly find, review, and safely trash exact and near-duplicate files on Linux.

tdupes detects exact duplicates (byte-identical, via fdupes) and optionally near-duplicates (same basename, scored by content similarity, via plocate/locate). Results are written to a TSV that you review and edit with your favourite spreadsheet tool before any files are touched; confirmed deletions go to gio trash and remain recoverable until the bin is emptied.

Key features:

  • Accepts any mix of individual files and directories as arguments
  • Near-duplicate detection with -L: for files given as arguments it finds same-basename files across the filesystem, with a similarity score (text %, binary same/different size)
  • Preferred-directory protection — files inside configured dirs are never proposed to be deleted by default
  • Preferred dirs and exclusion patterns can be specified by config file or via -p/-x flags upon execution
  • Prepares a smart action plan to a TSV table and allows its interactive editing with your favourite spreadsheet tool (TSV opened with xdg-open)
  • Automated batch mode also available (the TSV serves then as a log)

Install

pip install tdupes

System dependencies (Ubuntu/Debian):

sudo apt install fdupes plocate gvfs-bin xdg-utils

Usage

tdupes [OPTIONS] PATH [PATH ...]

Positional arguments:
  PATH               Files or directories to scan for duplicates

Options:
  -l, --locate       Expand file arguments via locatedb (exact basename matches)
  -L, --locate-all   Like -l, but also tabulate near-duplicates (same basename,
                     not byte-identical) with real similarity codes
  -t FILE, --tsv FILE
                     Path for the output TSV (default: temp file)
  -p DIR, --prefer DIR
                     Mark DIR as preferred at runtime (files inside are never
                     proposed for deletion). Additive with config. Repeatable.
  -x PATTERN, --exclude PATTERN
                     Shell glob to exclude files by full path. Additive with
                     config. Repeatable: -x '*.tmp' -x '/mnt/*'
  -b, --batch        Batch mode: no prompts; execute DELETE actions immediately
  -v, --verbose      Increase output verbosity
  -q, --quiet        Reduce output verbosity
  -c, --config FILE  Config file path (default: $XDG_CONFIG_HOME/tdupes.yml)
  -V, --version      Show version and exit
  -h, --help         Show this help message and exit

Examples

# Scan two directories interactively
tdupes ~/Pictures ~/Downloads

# Use locate to also find exact-duplicate copies of a specific file
tdupes --locate ~/Downloads/photo.jpg ~/Pictures

# Use locate and also include near-duplicates (same basename, different content)
tdupes -L ~/Downloads/photo.jpg ~/Pictures

# Batch mode (good for scripting / cron)
tdupes --batch ~/Documents

# Write the TSV to a specific path
tdupes -t /tmp/dupes.tsv ~/Music ~/Videos

Config

On first run tdupes creates $XDG_CONFIG_HOME/tdupes.yml (defaults to ~/.config/tdupes.yml):

preferred_directories: []   # files here are never proposed to be deleted
verbosity: 1                # 0=quiet, 1=normal, 2=verbose
tsv_output: null            # null = temp file each run
exclusion_patterns: []      # shell glob patterns to skip
batch_mode: false

preferred_directories — any file whose path begins with one of these directories will be marked keep regardless of group ordering.

TSV format

Action  Similarity  Size_KB  Modified              Path                              Comment
keep    100         2048.0   2024-11-01T14:22:10   /home/user/Pictures/photo.jpg     in preferred folder
DELETE  100         2048.0   2024-09-15T08:01:55   /home/user/Downloads/photo.jpg
Column Values
Action keep or DELETE — edit freely before confirming
Similarity 100 exact · XXX binary same size · NNN text % match · !!! binary diff size
Size_KB File size in kilobytes
Modified Last-modified timestamp (ISO 8601)
Path Absolute file path
Comment Reason for the proposed action (see below) — informational, ignored on re-read

Groups are separated by blank lines. The first entry in each group is either the file given as a CLI argument, or the newest copy.

Near-duplicate groups (found with -L) are written in a separate section after the exact-duplicate groups, preceded by a # comment line.

Default Action logic

Exact-duplicate groups (byte-identical per fdupes):

Comment tag Rule
in preferred folder File is inside a preferred_directories path → keep
last in group Last file in the group (tiebreaker) → keep
(no tag) All other copies → DELETE

CLI argument files are listed first in each group so they are never the last-in-group tiebreaker and therefore receive DELETE by default (unless they also fall under a preferred folder rule).

Near-duplicate groups (-L, same basename, not byte-identical):

Comment tag Rule
in preferred folder File is inside a preferred_directories path → keep
largest in basename group Overall largest file in the group, only if no preferred file is larger → keep
newest in basename group Overall newest file in the group, only if no preferred file is newer → keep
(no tag) Everything else → DELETE

CLI argument files are listed first and may receive DELETE if they are neither the largest nor the newest (and not in a preferred folder).

If a preferred-folder file is already the overall largest (or newest), no extra non-preferred copy is kept for that reason — the preferred file already covers it.

Multiple tags are comma-separated (e.g. largest in basename group, newest in basename group). The Comment column is read-only — it is ignored when tdupes re-reads the TSV after you edit it.

Interactive flow

  1. tdupes scans paths and prints the duplicate table.
  2. The TSV is opened with xdg-open for manual review.
  3. You edit Action cells (change DELETEkeep or vice-versa), save, return.
  4. tdupes re-reads the TSV and asks for confirmation.
  5. On confirmation, all DELETE files are sent to the trash via gio trash.
  6. A summary shows how many files were trashed and how much space was freed.

Files trashed with gio trash remain recoverable from the system trash until the bin is emptied.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tdupes-0.3.0.tar.gz (25.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tdupes-0.3.0-py3-none-any.whl (16.3 kB view details)

Uploaded Python 3

File details

Details for the file tdupes-0.3.0.tar.gz.

File metadata

  • Download URL: tdupes-0.3.0.tar.gz
  • Upload date:
  • Size: 25.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tdupes-0.3.0.tar.gz
Algorithm Hash digest
SHA256 3fa3381c433542f9e50eaf2ccf1ecad8613dac0c9ea885478a7635f30361919b
MD5 83c6e35e13dc9bb980474183b85865d8
BLAKE2b-256 b4ea1386ef8934a173ed1ef5f5cc7cd0fd54cfd181420987efc007a7759d0703

See more details on using hashes here.

Provenance

The following attestation bundles were made for tdupes-0.3.0.tar.gz:

Publisher: publish.yml on sjjsy/tdupes

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tdupes-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: tdupes-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 16.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tdupes-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bc6d9abc120468b883929f9476189936b7ce519a523dd9f4b04c0ed1ac90e6f9
MD5 8ad2b22d354c02ff78ae1de52f25cca4
BLAKE2b-256 9c0846be710f2e749ed5e7d13195a686b3b343420532cdab83c562f4818a663a

See more details on using hashes here.

Provenance

The following attestation bundles were made for tdupes-0.3.0-py3-none-any.whl:

Publisher: publish.yml on sjjsy/tdupes

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page