Skip to main content

Find duplicates / copy and restructure file layout command-line tool

Project description

DedupeCopy

A multi-threaded command-line tool for finding duplicate files and copying/restructuring file layouts while eliminating duplicates.

License Python

Table of Contents

[!WARNING] Security Notice: DedupeCopy uses Python's pickle module for serializing manifest files. Do not load manifest files from untrusted sources, as this could lead to arbitrary code execution. Only use manifests that you have generated yourself or obtained from a trusted source.

Overview

DedupeCopy is designed for consolidating and restructuring sprawling file systems, particularly useful for:

  • Backup consolidation: Merge multiple backup sources while eliminating duplicates
  • Photo/media library organization: Consolidate photos from various devices and organize by date
  • File system cleanup: Identify and remove duplicate files
  • Server migration: Copy files to new storage while preserving structure
  • Duplicate detection: Generate reports of duplicate files without copying
  • Deleting duplicates: Reclaim disk space by removing redundant files.

The good bits:

  • Uses MD5 checksums for accurate duplicate detection
  • Multi-threaded for fast processing
  • Manifest system for resuming interrupted operations
  • Flexible path restructuring rules
  • Can compare against multiple file systems without full re-scans
  • Configurable logging with verbosity levels (quiet, normal, verbose, debug)
  • Colored output for better readability (optional)
  • Helpful error messages with actionable suggestions
  • Real-time progress with processing rates

Note: This is not a replacement for rsync or Robocopy for incremental synchronization. Those are good tools that might work for you, so do try them.

Architecture

DedupeCopy uses a multi-threaded pipeline architecture to maximize performance when processing large file systems. Understanding this architecture helps explain performance characteristics and tuning options.

High-Level Design

┌─────────────────────────────────────────────────────────────────┐
│                         MAIN THREAD                             │
│  • Orchestrates the entire operation                            │
│  • Manages thread lifecycle and coordination                    │
│  • Handles manifest loading/saving                              │
└─────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────┐
│                     THREAD POOLS (Queues)                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌────────────────┐    ┌────────────────┐    ┌───────────────┐  │
│  │  Walk Threads  │───▶│  Read Threads  │───▶│ Copy/Delete   │  │
│  │  (4 default)   │    │  (8 default)   │    │   Threads     │  │
│  │                │    │                │    │  (8 default)  │  │
│  └────────────────┘    └────────────────┘    └───────────────┘  │
│         │                      │                      │         │
│         ▼                      ▼                      ▼         │
│   Walk Queue            Work Queue            Copy/Delete       │
│   (directories)         (files to hash)         Queue           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────┐
│                      PROGRESS THREAD                            │
│  • Collects status updates from all worker threads              │
│  • Displays progress, rates, and statistics                     │
│  • Logs errors and warnings                                     │
└─────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────┐
│                    RESULT PROCESSOR                             │
│  • Processes file hashes from read threads                      │
│  • Detects duplicate files                                      │
│  • Updates manifest and collision dictionaries                  │
│  • Performs incremental saves every 50,000 files                │
└─────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────┐
│                    PERSISTENT STORAGE                           │
│  • Manifest: Maps hash → list of files with that hash           │
│  • Collision DB: Tracks duplicate files                         │
│  • SQLite-backed with disk caching (LRU-like eviction)          │
│  • Auto-saves every 50,000 files for crash recovery             │
└─────────────────────────────────────────────────────────────────┘

Pipeline Stages

1. Walk Stage (WalkThread)

  • Purpose: Discover files and directories in the source paths
  • Thread Count: 4 by default (configurable with --walk-threads)
  • Output: Adds directories to walk_queue and files to work_queue
  • Filtering: Applies extension filters and ignore patterns

2. Read Stage (ReadThread)

  • Purpose: Hash file contents to detect duplicates
  • Thread Count: 8 by default (configurable with --read-threads)
  • Input: Files from work_queue
  • Output: Tuple of (hash, size, mtime, filepath) to result_queue
  • Algorithm: md5 or xxh64 (configurable with --hash-algo)

3. Result Processing (ResultProcessor)

  • Purpose: Aggregate hashes and detect collisions
  • Thread Count: 1 (single-threaded for data consistency)
  • Input: Hash results from result_queue
  • Output: Updates manifest and collisions dictionaries
  • Auto-save: Incremental saves every 50,000 processed files

4. Copy/Delete Stage (CopyThread/DeleteThread)

  • Purpose: Perform file operations based on duplicate analysis
  • Thread Count: 8 by default (configurable with --copy-threads)
  • Input: Files from copy_queue or delete_queue
  • Operations:
    • Copy unique files to destination (with optional path rules)
    • Delete duplicate files (keeping first occurrence)

5. Progress Monitoring (ProgressThread)

  • Purpose: Centralized logging and status updates
  • Thread Count: 1 (collects events from all other threads)
  • Features:
    • Real-time progress with processing rates
    • Queue size monitoring (to detect bottlenecks)
    • Error aggregation and reporting
    • Final summary statistics

Data Structures

Manifest

  • Storage: Disk-backed dictionary (SQLite + cache layer)
  • Format: hash → [(filepath, size, mtime), ...]
  • Cache: In-memory LRU cache (10,000 items default)
  • Persistence: Auto-saved during operation and on completion

Disk Cache Dictionary

  • Purpose: Handle datasets larger than available RAM
  • Backend: SQLite with Write-Ahead Logging (WAL)
  • Optimization: Batched commits for performance
  • Eviction: LRU or random eviction when cache is full

Performance Characteristics

Queue-Based Throttling: When queues exceed 100,000 items, the system introduces deliberate delays to prevent memory exhaustion.

Bottleneck Detection: Progress thread monitors queue sizes:

  • Walk queue growing: Too many directories, consider reducing --walk-threads
  • Work queue growing: Hashing is slower than discovery, increase --read-threads
  • Copy queue growing: I/O is slower than hashing, increase --copy-threads

I/O Patterns:

  • Walk threads: Mostly metadata operations (directory listings)
  • Read threads: Sequential reads of entire files
  • Copy threads: Large sequential writes

Thread Safety

  • Queues: Python's queue.Queue provides thread-safe operations
  • Manifest: Uses database-level locking (SQLite RLock)
  • Save operations: Coordinated via save_event to pause workers during saves

Installation

Via pip (recommended)

pip install DedupeCopy

With color support (optional)

For colored console output (errors in red, warnings in yellow, etc.):

pip install DedupeCopy[color]

With fast hashing (optional)

For faster file hashing using xxhash:

pip install DedupeCopy[fast_hash]

From source

git clone https://github.com/othererik/dedupe_copy.git
cd dedupe_copy
pip install -e .
# Or with color support:
pip install -e .[color]

Requirements

  • Python 3.11 or later
  • Sufficient disk space for manifest files (typically small, but can grow for very large file sets)
  • Optional: colorama for colored console output (installed with [color] extra)

Quick Start

Find duplicates in a directory

dedupecopy -p /path/to/search -r duplicates.csv

This scans /path/to/search and creates a CSV report of all duplicate files.

Copy files while removing duplicates

dedupecopy -p /source/path -c /destination/path

This copies all files from source to destination, skipping any duplicates.

Copy with manifest (recommended for large operations)

dedupecopy -p /source/path -c /destination/path -m manifest.db

Creates a manifest file that allows you to resume if interrupted. For example, if the operation is stopped, you can resume it with:

dedupecopy -p /source/path -c /destination/path -i manifest.db -m manifest.db

Delete Duplicates

dedupecopy -p /path/to/search --delete --dry-run

This will scan the specified path and show you which files would be deleted. Once you are sure, you can run the command again without --dry-run to perform the deletion.

Key Concepts

Manifests

Manifests are database files that store:

  • MD5 checksums of processed files
  • File metadata (size, modification time, path)
  • Which files have been scanned

Benefits:

  • Resumability: If an operation is interrupted, you can resume it by loading the manifest. The tool will skip any files that have already been processed.
  • Comparison: Manifests can be used to compare file systems without re-scanning. For example, you can compare a source and a destination to see which files are missing from the destination.
  • Incremental Backups: By loading a manifest from a previous backup, you can process only new or modified files.
  • Tracking: Manifests keep a record of which files have been processed, which is useful for auditing and tracking.

Manifest Options:

  • -m manifest.db - Save manifest after processing (output)
  • -i manifest.db - Load existing manifest before processing (input)
  • --compare manifest.db - Load manifest for duplicate checking only (does not copy those files)

Important Safety Rule: You cannot use the same file path for both -i (input) and -m (output). This prevents accidental manifest corruption during operations.

Understanding -i vs --compare

Both -i and --compare use existing manifests to determine which files to skip, but they serve different purposes.

-i / --manifest-read-path (Input Manifest)

  • Purpose: Resume an interrupted operation or continue from a previous run.
  • Behavior: Files in this manifest are considered "already processed" and are skipped. They are included in the output manifest.
  • Use Case: You started a large copy, it was interrupted, and you want to resume without re-scanning everything.
# Initial run (interrupted)
dedupecopy -p /source -c /dest -m progress.db

# Resume (skips files in progress.db)
dedupecopy -p /source -c /dest -i progress.db -m progress_new.db

--compare (Comparison Manifest)

  • Purpose: Deduplicate against another location.
  • Behavior: Files in this manifest are treated as duplicates and are not copied. They are not included in the output manifest.
  • Use Case: You want to back up new photos from your phone, but you want to skip any photos that are already in your main archive.
# Incremental backup (skip files already in the main backup)
dedupecopy -p /phone_backup -c /main_archive --compare main_archive.db -m phone_backup_new.db

Key Differences

Feature -i (Input Manifest) --compare (Comparison Manifest)
Files Copied? No (already processed) No (treated as duplicates)
Included in Output? Yes No
Primary Use Case Resume Operations Deduplicate Across Sources
Can use with same output? No (safety rule) Yes

Note on --no-walk: When using -i or --compare, you can also use --no-walk to prevent the tool from scanning the source file system. This is useful when you want to operate only on the files listed in the manifests.

Duplicate Detection

Files are considered duplicates when:

  1. They have identical MD5 checksums
  2. They have the same file size

Special case: Empty (zero-byte) files are treated as unique by default. Use --dedupe-empty to treat them as duplicates.

Usage Examples

Basic Operations

1. Generate a duplicate file report

dedupecopy -p /Users/johndoe -r dupes.csv -m manifest.db

Creates a CSV report of all duplicates and saves a manifest for future use.

With quiet output (minimal):

dedupecopy -p /Users/johndoe -r dupes.csv -m manifest.db --quiet

With verbose output (detailed progress):

dedupecopy -p /Users/johndoe -r dupes.csv -m manifest.db --verbose

2. Copy specific file types

dedupecopy -p /source -c /backup -e jpg -e png -e gif

Copy only image files (jpg, png, gif) to the backup directory.

Photo Organization

Organize photos by date

dedupecopy -p C:\pics -p D:\pics -e jpg -R "jpg:mtime" -c X:\organized_photos

Copies all JPG files from C: and D: drives, organizing them into folders by year/month (e.g., 2024_03/). See the "Path and Extension Rules" section for more on combining rules.

Multi-Source Consolidation

Copy from multiple sources to single destination

dedupecopy -p /source1 -p /source2 -p /source3 -c /backup -m backup_manifest.db

Scans all three source paths and copies unique files to backup.

Resume an interrupted copy

dedupecopy -p /source -c /destination -i manifest.db -m manifest.db

Loads the previous manifest and resumes where it left off.

Advanced Pattern Matching

Ignore specific patterns

dedupecopy -p /source -c /backup --ignore "*.tmp" --ignore "*.cache" --ignore "**/Thumbs.db"

Excludes temporary files and thumbnails from processing.

Extension-specific rules

dedupecopy -p /media -c /organized \
  -R "*.jpg:mtime" \
  -R "*.mp4:extension" \
  -R "*.doc*:no_change"

Different organization rules for different file types.

Command-Line Options

Required Options (one of)

Option Description
-p PATH, --read-path PATH Source path(s) to scan. Can be specified multiple times.
--no-walk Skip file system walk; use paths from loaded manifest only.

Core Options

Option Description
-c PATH, --copy-path PATH Destination path for copying files. Cannot be used with --delete.
--delete Deletes duplicate files, keeping the first-seen file. Cannot be used with --copy-path.
-r PATH, --result-path PATH Output path for CSV duplicate report.
-m PATH, --manifest-dump-path PATH Path to save manifest file.
-i PATH, --manifest-read-path PATH Path to load existing manifest. Can be specified multiple times.
-e EXT, --extensions EXT File extension(s) to include (e.g., jpg, *.png). Can be specified multiple times.
--ignore PATTERN File pattern(s) to exclude (supports wildcards). Can be specified multiple times.
-R RULE, --path-rules RULE Path restructuring rule(s) in format extension:rule. Can be specified multiple times.

Special Options

Option Description
--compare PATH Load manifest but don't copy its files (for comparison only). This is useful for excluding files that are already present in a destination or another source. Can be specified multiple times.
--copy-metadata Preserve file timestamps and permissions (uses shutil.copy2 instead of copyfile).
--dedupe-empty Treat empty (0-byte) files as duplicates rather than unique.
--ignore-old-collisions Only detect new duplicates (ignore duplicates already in loaded manifest).
--dry-run Simulate operations without making any changes to the filesystem.
--min-delete-size BYTES Minimum size of a file to be considered for deletion (e.g., 1048576 for 1MB). Default: 0.
--delete-on-copy Deletes source files after a successful copy. Requires --copy-path and -m. WARNING: this will consider duplicated objects as copied and remove them.

Output Control Options

Option Description
-q, --quiet Show only warnings and errors (minimal output).
-v, --verbose Show detailed progress information (same as normal, kept for compatibility).
--debug Show debug information including queue states and internal diagnostics.
--no-color Disable colored output (useful for logging to files or non-terminal output).

Output Verbosity Levels:

  • Normal (default): Standard progress updates, errors, and summaries.
  • Quiet (--quiet): Only warnings, errors, and the final summary.
  • Verbose (--verbose): More frequent progress updates that include processing rates and timing details.
  • Debug (--debug): All output including queue states and internal operations for troubleshooting.

Performance Options

Option Default Description
--walk-threads N 4 Number of threads for file system traversal.
--read-threads N 8 Number of threads for reading and hashing files.
--copy-threads N 8 Number of threads for copying files.
--hash-algo ALGO md5 Hashing algorithm to use (md5 or xxh64). xxh64 requires the fast_hash extra.

Path Conversion Options

Option Description
--convert-manifest-paths-from PREFIX Original path prefix in manifest to replace.
--convert-manifest-paths-to PREFIX New path prefix (useful when drive letters or mount points change).

Path and Extension Rules

This section explains how to control file selection and organization using extension filters (-e), ignore patterns (--ignore), and path restructuring rules (-R).

Filtering Files by Extension

Use the -e or --extensions option to specify which file types to process.

  • jpg: Matches .jpg files.
  • *.jp*g: Matches .jpg, .jpeg, .jpng, etc.
  • *: Matches all extensions.

If no -e option is provided, all files are processed by default.

Ignoring Files and Directories

Use --ignore to exclude files or directories that match a specific pattern.

  • "*.tmp": Ignores all files with a .tmp extension.
  • "**/Thumbs.db": Ignores Thumbs.db files in any directory.
  • "*.cache": Ignores all files ending in .cache.

Restructuring Destination Paths

Path rules (-R or --path-rules) determine how files are organized in the destination directory. The format is pattern:rule.

Default Behavior: If no -R flag is specified, the original directory structure is preserved (equivalent to -R "*:no_change"). This is the most intuitive behavior for backup and copy operations.

Available Rules

Rule Description Example Output
no_change [DEFAULT] Preserves the original directory structure /dest/original/path/file.jpg
mtime Organizes by modification date (YYYY_MM) /dest/2024_03/file.jpg
extension Organizes into folders by file extension /dest/jpg/file.jpg

Combining Rules

Rules are applied in the order they are specified, creating nested directories.

# Organize first by extension, then by date
dedupecopy -p /source -c /backup -R "*:extension" -R "*:mtime"

Result: /backup/jpg/2024_03/photo.jpg

Pattern Matching for Rules

The pattern part of the rule determines which files the rule applies to. It supports wildcards, just like the -e filter.

  • "*.jpg:mtime": Applies the mtime rule only to JPG files.
  • "*.jp*g:mtime": Applies to .jpg, .jpeg, etc.
  • "*:no_change": Applies the no_change rule to all files.

The most specific pattern wins if multiple patterns could match a file.

Example: Different Rules for Different Files

dedupecopy -p /media -c /organized \
  -R "*.jpg:mtime" \
  -R "*.mp4:extension" \
  -R "*.pdf:no_change"
  • JPG files are organized by date.
  • MP4 files are organized into an mp4 folder.
  • PDF files keep their original directory structure.

Advanced Workflows

Sequential Multi-Source Backup

When consolidating from multiple sources to a single target while avoiding duplicates between sources:

Step 1: Create manifests for all locations

# Scan target (if it has existing files)
dedupecopy -p /backup/target -m target_manifest.db

# Scan each source
dedupecopy -p /source1 -m source1_manifest.db
dedupecopy -p /source2 -m source2_manifest.db

Step 2: Copy each source sequentially

# Copy source1 (skip files already in target or source2)
dedupecopy -p /source1 -c /backup/target \
  --compare target_manifest.db \
  --compare source2_manifest.db \
  -m target_v1.db \
  --no-walk

# Copy source2 (skip files already in target or source1)
dedupecopy -p /source2 -c /backup/target \
  --compare target_v1.db \
  --compare source1_manifest.db \
  -m target_v2.db \
  --no-walk

How it works:

  • --no-walk skips re-scanning the filesystem (uses manifest data from -i or scans --compare manifests)
  • --compare loads manifests for duplicate checking but doesn't copy those files
  • Each source is copied only if files aren't already in target or other sources
  • Each step creates a new manifest tracking what's been copied so far

Note: We use --compare instead of -i because:

  • -i + -m cannot use the same file path (safety rule)
  • --compare is designed for exactly this use case (deduplication across sources)
  • The source manifests are used with --no-walk to avoid re-scanning

Manifest Path Conversion

If drive letters or mount points change between runs:

dedupecopy -i old_manifest.db -m new_manifest.db \
  --convert-manifest-paths-from "/Volumes/OldDrive" \
  --convert-manifest-paths-to "/Volumes/NewDrive" \
  --no-walk

Updates all paths in the manifest without re-scanning files.

Incremental Backup

The most common use case for incremental backups is to copy new files from a source to a destination, skipping files that are already in the destination.

Step 1: Create a manifest of the destination

First, create a manifest of your destination directory. This gives you a record of what's already there.

dedupecopy -p /path/to/backup -m backup.db

Step 2: Run the incremental copy

Now, you can copy new files from your source, using --compare to skip duplicates that are already in the backup.

dedupecopy -p /path/to/source -c /path/to/backup --compare backup.db -m backup_new.db

How it works:

  • --compare efficiently checks for duplicates without re-scanning the entire destination.
  • Only new files from the source are copied.
  • A new manifest (backup_new.db) is created, which includes both the old and new files. You can use this for the next incremental backup.

Example: Golden Directory Backup

This is useful for maintaining a "golden" directory with unique files from multiple sources.

# 1. Create a manifest of the golden directory
dedupecopy -p /golden_dir -m golden.db

# 2. Copy new, unique files from multiple sources
dedupecopy -p /source1 -p /source2 -c /golden_dir --compare golden.db -m golden_new.db

Comparison Without Copying

Compare two directories to find what's different:

# Scan both locations
dedupecopy -p /location1 -m manifest1.db
dedupecopy -p /location2 -m manifest2.db

# Generate report of files in location1 not in location2
dedupecopy -p /location1 -i manifest1.db --compare manifest2.db -r unique_files.csv --no-walk

Performance Tips

Thread Count Tuning

Default settings (4/8/8) work well for most scenarios.

For SSDs/NVMe:

--walk-threads 8 --read-threads 16 --copy-threads 16

For HDDs:

--walk-threads 2 --read-threads 4 --copy-threads 4

For network shares:

--walk-threads 2 --read-threads 4 --copy-threads 2

Network latency makes more threads counterproductive.

Large File Sets

For very large directories (millions of files):

  1. Use manifests - Essential for resumability
  2. Process in batches - Use --ignore to exclude subdirectories, process separately
  3. Monitor memory - Manifests use disk-based caching to minimize memory usage
  4. Incremental saves - Manifests auto-save every 50,000 files

Network Considerations

  • Network paths may timeout - Tool retries after 3 seconds
  • SMB/CIFS shares - Use lower thread counts
  • Bandwidth limits - Reduce copy threads to avoid saturation
  • VPN connections - May need much lower thread counts

Manifest Storage

  • Manifest files are stored as SQLite database files
  • Size is proportional to number of unique files (typically a few MB per 100k files)
  • Keep manifests on fast storage (SSD) for best performance
  • Manifests are incrementally saved every 50,000 processed files

Logging and Output Control

Verbosity Levels

DedupeCopy provides flexible output control to suit different use cases:

Normal Mode (Default)

Standard output with progress updates every 1,000 files:

dedupecopy -p /source -c /destination

Output includes:

  • Pre-flight configuration summary
  • Progress updates with file counts and processing rates
  • Error messages with helpful suggestions
  • Final summary statistics

Quiet Mode

Minimal output - only warnings, errors, and final results:

dedupecopy -p /source -c /destination --quiet

Best for:

  • Cron jobs and automated scripts
  • When you only care about problems
  • Reducing log file sizes

Verbose Mode

Detailed progress information:

dedupecopy -p /source -c /destination --verbose

Output includes:

  • All normal mode output
  • More frequent progress updates
  • Detailed timing and rate information

Debug Mode

Comprehensive diagnostic information:

dedupecopy -p /source -c /destination --debug

Output includes:

  • All verbose mode output
  • Queue sizes and internal state
  • Thread activity details
  • Useful for troubleshooting performance issues

Color Output

By default, DedupeCopy uses colored output when writing to a terminal (if colorama is installed):

  • Errors: Red text
  • Warnings: Yellow text
  • Info messages: Default color
  • Debug messages: Cyan text

To disable colors (e.g., when logging to a file):

dedupecopy -p /source -c /destination --no-color

Colors are automatically disabled when output is redirected to a file or pipe.

Enhanced Features

Pre-Flight Summary

Before starting operations, you'll see a summary of configuration:

======================================================================
DEDUPE COPY - Operation Summary
======================================================================
Source path(s): 2 path(s)
  - /Volumes/Source1
  - /Volumes/Source2
Destination: /Volumes/Backup
Extension filter: jpg, png, gif
Path rules: *.jpg:mtime
Threads: walk=4, read=8, copy=8
Options: dedupe_empty=False, preserve_stat=True, no_walk=False
======================================================================

Progress with Rates

During operation, you'll see processing rates:

Discovered 5000 files (dirs: 250), accepted 4850. Rate: 142.3 files/sec
Work queue has 234 items. Progress queue has 12 items. Walk queue has 5 items.
...
Copied 4800 items. Skipped 50 items. Rate: 125.7 files/sec

Helpful Error Messages

Errors include context and suggestions:

Error processing '/path/to/file.txt': [PermissionError] Permission denied
  Suggestions: Check file permissions; Ensure you have read access to source files

Proactive Warnings

The tool warns about potential issues before they become problems:

WARNING: Work queue is large (42000 items). Consider reducing thread counts to avoid memory issues.
WARNING: Progress queue is backing up (12000 items). This may indicate slow processing.

Examples

Silent operation for scripts

dedupecopy -p /source -c /backup --quiet 2>&1 | tee backup.log

Maximum detail for troubleshooting

dedupecopy -p /source -c /backup --debug --no-color > debug.log 2>&1

Normal operation with color

dedupecopy -p /source -c /backup --verbose

Troubleshooting

Common Issues

"Directory disappeared during walk"

Cause: Network path timeout or files deleted during scan.

Solution:

  • Reduce --walk-threads for network paths
  • Ensure stable network connection
  • Exclude volatile directories with --ignore

Out of Memory Errors

Cause: Very large queue sizes.

Solution:

  • Reduce thread counts
  • Process in smaller batches
  • Ensure sufficient swap space

Permission Errors

Cause: Insufficient permissions on source or destination.

Solution:

# Check permissions
ls -la /source/path
ls -la /destination/path

# Run with appropriate user or use sudo (carefully!)

Resuming Interrupted Runs

If a run is interrupted:

# Resume using the manifest
dedupecopy -p /source -c /destination -i manifest.db -m manifest.db

Files already processed (in manifest) are skipped.

Manifest Corruption

If manifest files become corrupted:

Solution:

  • Delete manifest files and restart
  • Manifest files: manifest.db and manifest.db.read
  • Consider keeping backup copies of manifests for very long operations

Getting Help

Check the output during run:

  • Progress updates every 1,000 files with processing rates
  • Error messages show problematic files with helpful suggestions
  • Warnings alert you to potential issues proactively
  • Final summary shows counts and errors

For debugging, use --debug mode:

dedupecopy -p /source -c /destination --debug --no-color > debug.log 2>&1

Debug output includes:

  • File counts and progress with timing
  • Queue sizes and internal state (useful if growing unbounded)
  • Thread activity and performance metrics
  • Specific error messages with file paths and suggestions

Safety and Best Practices

⚠️ Important Warnings

  1. Test first: Run with -r (report only) before using -c (copy) on important data
  2. Backup important data: Always have backups before restructuring
  3. Use manifests: They provide a record of what was processed
  4. Verify results: Check file counts and spot-check files after copy operations
  5. Watch disk space: Ensure sufficient space on destination

Manifest Safety Rules

To prevent accidental data loss, DedupeCopy enforces the following rules for manifest usage:

  1. Destructive Operations Require an Output Manifest: Any operation that modifies the set of files being tracked (e.g., --delete, --delete-on-copy) requires the -m/--manifest-dump-path option. This ensures that the results of the operation are saved to a new manifest, preserving the original.

  2. Input and Output Manifests Must Be Different: To protect your original manifest, you cannot use the same file path for both -i/--manifest-read-path and -m/--manifest-dump-path. This prevents the input manifest from being overwritten.

Example of a safe delete operation:

# Load an existing manifest and save the changes to a new one
dedupecopy --no-walk --delete -i manifest_original.db -m manifest_after_delete.db

Recommended Workflow

# Step 1: Generate report to understand what will happen
dedupecopy -p /source -r preview.csv -m preview_manifest.db

# Step 2: Review the CSV report
# Check duplicate counts, file types, sizes

# Step 3: Run the actual copy with manifest
dedupecopy -p /source -c /destination -i preview_manifest.db -m final_manifest.db

# Step 4: Verify
# Check file counts, spot-check files, verify important files copied

What Gets Copied / Deleted

  • First occurrence of each unique file (by MD5 hash) is kept.
  • Subsequent identical files are either copied to the destination or deleted from the source.
  • Files are considered unique if their MD5 hash differs.
  • By default, empty files are treated as unique. Use --dedupe-empty to treat them as duplicates.
  • Ignored patterns (--ignore) are never copied or deleted.

What Doesn't Get Copied / Deleted

  • The first-seen version of a file is never deleted.
  • Files matching --ignore patterns.
  • Files listed in --compare manifests (used for comparison only).
  • Extensions not matching the -e filter (if specified).
  • Symbolic links are not followed and are ignored by the tool.

Preserving Metadata

By default, only file contents are copied. To preserve timestamps and permissions:

dedupecopy -p /source -c /destination --copy-metadata

This uses Python's shutil.copy2() which preserves:

  • Modification time
  • Access time
  • File mode (permissions)

Note: Not all metadata may transfer across different file systems.

Output Files

CSV Duplicate Report

Format: Collision #, MD5, Path, Size (bytes), mtime

Src: ['/source/path']
Collision #, MD5, Path, Size (bytes), mtime
1, d41d8cd98f00b204e9800998ecf8427e, '/path/file1.jpg', 1024, 1633024800.0
1, d41d8cd98f00b204e9800998ecf8427e, '/path/file2.jpg', 1024, 1633024800.0
2, a3d5c12f8b9e4a1c2d3e4f5a6b7c8d9e, '/path/doc1.pdf', 2048, 1633111200.0
2, a3d5c12f8b9e4a1c2d3e4f5a6b7c8d9e, '/path/doc2.pdf', 2048, 1633111200.0

Manifest Files

Binary database files (not human-readable):

  • manifest.db - MD5 hashes and file metadata
  • manifest.db.read - List of processed file paths

These enable resuming and incremental operations.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the Simplified BSD License.

Project Links

Author

Erik Schweller (othererik@gmail.com)


Status: Tested and seems to work, but use with caution and always backup important data!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dedupecopy-1.2.1.tar.gz (135.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dedupecopy-1.2.1-py3-none-any.whl (134.8 kB view details)

Uploaded Python 3

File details

Details for the file dedupecopy-1.2.1.tar.gz.

File metadata

  • Download URL: dedupecopy-1.2.1.tar.gz
  • Upload date:
  • Size: 135.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for dedupecopy-1.2.1.tar.gz
Algorithm Hash digest
SHA256 6cf0f00db0a4b0cefaf5ba0276338bfd8986af1fea40336549f68bcf5de8622a
MD5 532f768efa63353a6447cfda7c978a55
BLAKE2b-256 9a2682e5ac214a8428f6ed6bfad7b753c659c79070aab2a01318676346feb159

See more details on using hashes here.

File details

Details for the file dedupecopy-1.2.1-py3-none-any.whl.

File metadata

  • Download URL: dedupecopy-1.2.1-py3-none-any.whl
  • Upload date:
  • Size: 134.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for dedupecopy-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ffeb3c475246467cd75eb5d73c6e8bf741faa6f43f6c0c53805f2045a144dc3d
MD5 4752c7793056c3a57b2bb34731ed7987
BLAKE2b-256 d3bd7f3147d8eada056596f71a6bed8f8ce571395de6d40d55b916745854b52e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page