Skip to main content

SHA256 hash-based file renaming for privacy and deduplication

Project description

namecrawler

SHA256 hash-based file renaming for privacy and deduplication

Rename files using their SHA256 content hash, creating deterministic, collision-resistant, privacy-preserving filenames.

Installation

pip install namecrawler

Quick Start

# Rename single file
namecrawler document.pdf

# Rename multiple files
namecrawler *.jpg

# Rename files in a directory
namecrawler ~/Documents/*.pdf

Features

  • Deterministic: Same content = same filename (every time)
  • Collision-Resistant: SHA256 makes accidental collisions virtually impossible
  • Privacy-Preserving: Original filenames not exposed
  • Deduplication-Friendly: Identical files get same hash (easy to find duplicates)
  • Format-Preserving: Original file extensions maintained
  • Fast: Efficient chunk-based hashing (8KB chunks)
  • Safe: Only renames files that exist

Use Cases

1. Privacy Protection

Hide sensitive information in original filenames:

# Before: SSN_123-45-6789_tax_return_2024.pdf
# After:  a3f89b2c1d4e5f6g7h8i9j0k1l2m3n4o5p6q7r8s9t0u1v2w3x4y5z6.pdf
namecrawler sensitive_document.pdf

2. Deduplication

Find duplicate files easily:

namecrawler ~/Downloads/*.jpg
# Duplicate files will have the same hash name
# Just look for repeated filenames!

3. Content-Based Organization

Files with same content automatically grouped:

namecrawler backup_folder/*
# Version 1, 2, 3 of same file → all get same hash

4. Archival Storage

Create immutable, content-addressed archives:

namecrawler archive/*.* 
# Filenames never change if content doesn't change

How It Works

  1. Reads file content in 8KB chunks (memory efficient)
  2. Computes SHA256 hash of the entire content
  3. Preserves file extension from original filename
  4. Renames file to {hash}{extension}

Example:

# Original file: "meeting_notes_2024.txt"
# Content hash: "a1b2c3d4e5f6..."
# New filename: "a1b2c3d4e5f6...txt"

API Usage

Use as a Python library:

from namecrawler.cli import sha256sum, rename_file
from pathlib import Path

# Get hash of a file
file_path = Path("document.pdf")
file_hash = sha256sum(file_path)
print(f"SHA256: {file_hash}")

# Rename using hash
new_path = rename_file(file_path)
print(f"Renamed to: {new_path}")

Comparison with Other Tools

Tool Method Reversible Privacy Speed
namecrawler SHA256 hash No High Fast
Manual rename User input Yes ❌ Low ❌ Slow
UUID tools Random UUID No High Fast
Timestamp tools Current time No ❌ Low Fast

Advantages over alternatives:

  • More meaningful than UUIDs (hash reveals if content changed)
  • More private than timestamps (no metadata leakage)
  • Deterministic (unlike random UUIDs)
  • Built-in deduplication (same content = same hash)

Requirements

  • Python 3.8+
  • No external dependencies (uses stdlib only)

Limitations

  • Not reversible: You cannot recover the original filename from the hash
  • Same content = same name: Files with identical content get identical names
  • No metadata preservation: Original filename lost (keep a mapping if needed)

Advanced Usage

Keep a rename log

# Create a simple mapping log
for file in *.pdf; do
  echo "$file -> $(namecrawler "$file")" >> rename_log.txt
done

Undo by using a log

namecrawler doesn't include undo (by design - hashes are one-way), but you can create your own:

import json
from pathlib import Path

# Before renaming, save a log
log = {}
for file in Path('.').glob('*.pdf'):
    from namecrawler.cli import sha256sum
    hash_name = sha256sum(file) + file.suffix
    log[hash_name] = str(file)

with open('rename_map.json', 'w') as f:
    json.dump(log, f, indent=2)

# Later, restore using the log
with open('rename_map.json') as f:
    log = json.load(f)
    for hash_name, original in log.items():
        Path(hash_name).rename(original)

Security Note

SHA256 hashes are cryptographically secure but not secret. If someone has the original file, they can compute the same hash. Use namecrawler for:

  • Privacy (hiding original filenames)
  • Deduplication (finding identical files)
  • Content-addressing (organizing by content)

Don't use for:

  • Security (anyone with original can verify hash)
  • Encryption (filenames are not encrypted)
  • Authentication (hashes alone don't prove ownership)

License

MIT License - see LICENSE file

Author

Luke Steuber


Fun fact: The name "namecrawler" reflects how the tool "crawls" through file content to generate a name, rather than using metadata or user input.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

namecrawler-1.0.0.tar.gz (8.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

namecrawler-1.0.0-py3-none-any.whl (7.8 kB view details)

Uploaded Python 3

File details

Details for the file namecrawler-1.0.0.tar.gz.

File metadata

  • Download URL: namecrawler-1.0.0.tar.gz
  • Upload date:
  • Size: 8.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for namecrawler-1.0.0.tar.gz
Algorithm Hash digest
SHA256 617967949bf40131da1e03e3aa601ccdf41efd00bd4f5b170a52460e50f64796
MD5 c352cd1580de20f5099a0cbd2ef4691f
BLAKE2b-256 7c7e66dd4327585876c347183d9dfecf0ff8b8ce2acecc898cf2e65050468de5

See more details on using hashes here.

File details

Details for the file namecrawler-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: namecrawler-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 7.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for namecrawler-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1e17a46bcacf77dd63e419f1fa200bf0c891cbeb5d414d5bbfeeb9b1ee831001
MD5 bd00c7e15d9a87b25268e9cb4a18dafd
BLAKE2b-256 9b33cba2c7457bf777952ad05162dd53261fe307938a67b0d278f268d23f2c3a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page