Find duplicate files in Google Drive using MD5 checksums
Project description
🔍 Find and manage duplicate files in Google Drive using MD5 checksums 🗑️
Features
- Fast MD5-based detection - Identifies duplicates by comparing file checksums, not just names
- Two interfaces - CLI for quick scans, Web UI for interactive review with file previews
- Non-destructive - Moves duplicates to
/_dupesfolder instead of deleting them - Preserves structure - Original folder hierarchy is maintained under the dupes folder
- Resumable sessions - Decisions auto-save and persist across sessions
- Flexible filtering - Scan specific paths and exclude folders from analysis
- Multi-account profiles - Target different Google Drive accounts with named profiles
Quick Start
# Install with uv
uv sync
# Launch the web UI
uv run main.py
First run: A browser window will open for Google OAuth authentication. Grant access to your Google Drive.
Installation
Prerequisites
- Python 3.10+
- uv package manager
- Google Cloud OAuth credentials (setup guide)
Google Cloud Setup
- Go to Google Cloud Console
- Create a project (or select existing)
- Enable the Google Drive API
- Create OAuth 2.0 Client ID (choose "Desktop app")
- Download the JSON file and save as
credentials.jsonin the project root
Usage
CLI Tool
# Scan entire drive
uv run main.py
# Scan specific folder
uv run main.py --path "/Photos"
# Exclude folders
uv run main.py --exclude "/Backup/Old" --exclude "/tmp"
# Custom output location
uv run main.py --output results.csv
# Validate credentials
uv run main.py --validate
# Debug logging
uv run main.py --verbose --log-file debug.log
# Profiles (multiple Google accounts)
uv run main.py --init-profile work
uv run main.py --list-profiles
uv run main.py --profile work
Web UI
uv run main.py
The web interface provides three tabs:
| Tab | Purpose |
|---|---|
| Scan | Run scans with path filtering and progress feedback |
| Review | Side-by-side comparison with file previews, make keep/skip decisions |
| Export | Preview moves (dry run), execute moves, export decisions to JSON |
Note: PDF preview requires poppler: brew install poppler (macOS)
Moving Duplicates
Instead of deleting, duplicates are moved to /_dupes at Drive root:
/Photos/2024/IMG.jpg → /_dupes/Photos/2024/IMG.jpg
- Scan - Find duplicates
- Review - Mark which files to keep
- Preview - Dry run to see what would move
- Execute - Move duplicates to
/_dupes
Configuration
Settings can be configured via environment variables, config.json, or CLI arguments.
Precedence: CLI > Profile config.yaml > Environment > Config file > Defaults
Environment Variables
| Variable | Default | Description |
|---|---|---|
GDRIVE_CREDENTIALS_PATH |
credentials.json |
OAuth credentials file |
GDRIVE_TOKEN_PATH |
(next to credentials) | OAuth token file |
GDRIVE_OUTPUT_DIR |
.output |
Output directory |
GDRIVE_DUPES_FOLDER |
/_dupes |
Folder for duplicates |
GDRIVE_BATCH_SIZE |
100 |
Batch size for API operations |
GDRIVE_MAX_PREVIEW_MB |
10 |
Max file size for previews |
GDRIVE_EXCLUDE_PATHS |
(none) | Comma-separated paths to exclude |
Config File
Create config.json in the project root:
{
"credentials_path": "~/.config/dedrive/credentials.json",
"output_dir": "~/.local/share/dedrive",
"dupes_folder": "/_dupes",
"batch_size": 100,
"exclude_paths": ["/Backup/Old", "/tmp"]
}
Profiles
Use profiles to manage multiple Google Drive accounts:
# Create a profile
uv run main.py --init-profile work
# Copy credentials into the profile
cp ~/Downloads/credentials.json profiles/work/
# Use the profile
uv run main.py --profile work
Each profile stores its own credentials.json, token.json, config.yaml, and .output/ under profiles/<name>/.
Output Files
| File | Description |
|---|---|
.output/duplicates.csv |
Scan results with duplicate pairs |
.output/decisions.json |
User decisions (auto-saved) |
.output/execution_log.json |
Move operation results |
.output/scan_results.json |
Cached scan data for session resume |
How It Works
- OAuth authentication - Cached in
token.jsonafter first login - Single API call - Fetches all files with MD5 metadata in one paginated request
- In-memory path resolution - Builds paths from parent IDs with memoization
- MD5 grouping - Groups files by checksum to identify duplicates
- Size validation - Files with same MD5 but different sizes flagged as "uncertain"
Note: Google Workspace files (Docs, Sheets, Slides) are skipped as they don't have MD5 checksums.
Re-authentication
If you previously used this tool with read-only access, delete token.json and re-authenticate to grant move permissions.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dedrive-0.1.1.tar.gz.
File metadata
- Download URL: dedrive-0.1.1.tar.gz
- Upload date:
- Size: 366.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1f306d6c16ff3b0f58c4f236e596edb6de631de2bbd812049339947deccc8ace
|
|
| MD5 |
32719033e0b671adb39821d1d3c4c4af
|
|
| BLAKE2b-256 |
a1a22a726e36216fb063ae5f679b5acd7a380dea610fc14da2c8cfc1e8da0c5c
|
Provenance
The following attestation bundles were made for dedrive-0.1.1.tar.gz:
Publisher:
release.yml on tsilva/dedrive
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dedrive-0.1.1.tar.gz -
Subject digest:
1f306d6c16ff3b0f58c4f236e596edb6de631de2bbd812049339947deccc8ace - Sigstore transparency entry: 1005472307
- Sigstore integration time:
-
Permalink:
tsilva/dedrive@222961431f5f25e82831d0035e3d9b75afa628c9 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/tsilva
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@222961431f5f25e82831d0035e3d9b75afa628c9 -
Trigger Event:
push
-
Statement type:
File details
Details for the file dedrive-0.1.1-py3-none-any.whl.
File metadata
- Download URL: dedrive-0.1.1-py3-none-any.whl
- Upload date:
- Size: 29.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9c5b8dadfeeba8e78b0ea06f35f39b986c0c5335ae1afb79f96ad414500e6296
|
|
| MD5 |
a719719f8c09789d3b57e42be103e76e
|
|
| BLAKE2b-256 |
6895c0e144b24f9823863777b2ce540d7f5d2c2f4fcd578b3056c4132950ccf2
|
Provenance
The following attestation bundles were made for dedrive-0.1.1-py3-none-any.whl:
Publisher:
release.yml on tsilva/dedrive
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dedrive-0.1.1-py3-none-any.whl -
Subject digest:
9c5b8dadfeeba8e78b0ea06f35f39b986c0c5335ae1afb79f96ad414500e6296 - Sigstore transparency entry: 1005472309
- Sigstore integration time:
-
Permalink:
tsilva/dedrive@222961431f5f25e82831d0035e3d9b75afa628c9 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/tsilva
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@222961431f5f25e82831d0035e3d9b75afa628c9 -
Trigger Event:
push
-
Statement type: