Skip to main content

Find duplicate files in Google Drive using MD5 checksums

Project description

dedrive

Python License Google Drive API

🔍 Find and manage duplicate files in Google Drive using MD5 checksums 🗑️

Features · Quick Start · Configuration · Usage


Features

  • Fast MD5-based detection - Identifies duplicates by comparing file checksums, not just names
  • Two interfaces - CLI for quick scans, Web UI for interactive review with file previews
  • Non-destructive - Moves duplicates to /_dupes folder instead of deleting them
  • Preserves structure - Original folder hierarchy is maintained under the dupes folder
  • Resumable sessions - Decisions auto-save and persist across sessions
  • Flexible filtering - Scan specific paths and exclude folders from analysis
  • Multi-account profiles - Target different Google Drive accounts with named profiles

Quick Start

# Install with uv
uv sync

# Launch the web UI
uv run main.py

First run: A browser window will open for Google OAuth authentication. Grant access to your Google Drive.

Installation

Prerequisites

  • Python 3.10+
  • uv package manager
  • Google Cloud OAuth credentials (setup guide)

Google Cloud Setup

  1. Go to Google Cloud Console
  2. Create a project (or select existing)
  3. Enable the Google Drive API
  4. Create OAuth 2.0 Client ID (choose "Desktop app")
  5. Download the JSON file and save as credentials.json in the project root

Usage

CLI Tool

# Scan entire drive
uv run main.py

# Scan specific folder
uv run main.py --path "/Photos"

# Exclude folders
uv run main.py --exclude "/Backup/Old" --exclude "/tmp"

# Custom output location
uv run main.py --output results.csv

# Validate credentials
uv run main.py --validate

# Debug logging
uv run main.py --verbose --log-file debug.log

# Profiles (multiple Google accounts)
uv run main.py --init-profile work
uv run main.py --list-profiles
uv run main.py --profile work

Web UI

uv run main.py

The web interface provides three tabs:

Tab Purpose
Scan Run scans with path filtering and progress feedback
Review Side-by-side comparison with file previews, make keep/skip decisions
Export Preview moves (dry run), execute moves, export decisions to JSON

Note: PDF preview requires poppler: brew install poppler (macOS)

Moving Duplicates

Instead of deleting, duplicates are moved to /_dupes at Drive root:

/Photos/2024/IMG.jpg  →  /_dupes/Photos/2024/IMG.jpg
  1. Scan - Find duplicates
  2. Review - Mark which files to keep
  3. Preview - Dry run to see what would move
  4. Execute - Move duplicates to /_dupes

Configuration

Settings can be configured via environment variables, config.json, or CLI arguments.

Precedence: CLI > Profile config.yaml > Environment > Config file > Defaults

Environment Variables

Variable Default Description
GDRIVE_CREDENTIALS_PATH credentials.json OAuth credentials file
GDRIVE_TOKEN_PATH (next to credentials) OAuth token file
GDRIVE_OUTPUT_DIR .output Output directory
GDRIVE_DUPES_FOLDER /_dupes Folder for duplicates
GDRIVE_BATCH_SIZE 100 Batch size for API operations
GDRIVE_MAX_PREVIEW_MB 10 Max file size for previews
GDRIVE_EXCLUDE_PATHS (none) Comma-separated paths to exclude

Config File

Create config.json in the project root:

{
  "credentials_path": "~/.config/dedrive/credentials.json",
  "output_dir": "~/.local/share/dedrive",
  "dupes_folder": "/_dupes",
  "batch_size": 100,
  "exclude_paths": ["/Backup/Old", "/tmp"]
}

Profiles

Use profiles to manage multiple Google Drive accounts:

# Create a profile
uv run main.py --init-profile work

# Copy credentials into the profile
cp ~/Downloads/credentials.json profiles/work/

# Use the profile
uv run main.py --profile work

Each profile stores its own credentials.json, token.json, config.yaml, and .output/ under profiles/<name>/.

Output Files

File Description
.output/duplicates.csv Scan results with duplicate pairs
.output/decisions.json User decisions (auto-saved)
.output/execution_log.json Move operation results
.output/scan_results.json Cached scan data for session resume

How It Works

  1. OAuth authentication - Cached in token.json after first login
  2. Single API call - Fetches all files with MD5 metadata in one paginated request
  3. In-memory path resolution - Builds paths from parent IDs with memoization
  4. MD5 grouping - Groups files by checksum to identify duplicates
  5. Size validation - Files with same MD5 but different sizes flagged as "uncertain"

Note: Google Workspace files (Docs, Sheets, Slides) are skipped as they don't have MD5 checksums.

Re-authentication

If you previously used this tool with read-only access, delete token.json and re-authenticate to grant move permissions.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dedrive-0.1.0.tar.gz (366.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dedrive-0.1.0-py3-none-any.whl (29.3 kB view details)

Uploaded Python 3

File details

Details for the file dedrive-0.1.0.tar.gz.

File metadata

  • Download URL: dedrive-0.1.0.tar.gz
  • Upload date:
  • Size: 366.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dedrive-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5540b1143bb9e1f053c90daca48603233dba057caab67b12f85d76bdcac1860e
MD5 3857af8457c7ab576524324a93b549b5
BLAKE2b-256 7c2ffbccf194fe12abc54b34c6579c018154c903a9871db49f09c2ff076b3db9

See more details on using hashes here.

Provenance

The following attestation bundles were made for dedrive-0.1.0.tar.gz:

Publisher: release.yml on tsilva/dedrive

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dedrive-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: dedrive-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 29.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for dedrive-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a6609ee02ff95db09ad7b28e87333ba1020e5034fa05ddec9a2483e6265be5ea
MD5 a54c33f9740f08c3f89a5d73c7945bfe
BLAKE2b-256 2086e6b7a7361d5e25fd0e8697860afb15401ca9140cf9e8d261b113a1612662

See more details on using hashes here.

Provenance

The following attestation bundles were made for dedrive-0.1.0-py3-none-any.whl:

Publisher: release.yml on tsilva/dedrive

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page