Skip to main content

Command-line tool to find and optionally delete duplicate files by content (SHA-256)

Project description

psamfinder — File duplicate finder

PyPI Python

psamfinder is a lightweight CLI tool that recursively scans directories for files with identical content (using SHA-256 hashing) and helps you manage duplicates interactively.

Requirements

  • Python 3.8+
  • hatchling (for building, referenced in pyproject.toml)

Installation

From PyPI (recommended):

pip install psamfinder
# or for isolated CLI install (recommended)
pipx install psamfinder


# For development/ from source
git clone https://github.com/psam-717/psamfinder.git
cd psamfinder
pip install -e .


## Running
- As a CLI (installed entry point):
  psamfinder scan <DIRECTORY> [--delete] [--dry-run] [-q]

- From source:
  python -m psamfinder scan <DIRECTORY> [--delete] [--dry-run] [-q]

Examples:
- Scan a directory and list duplicates:
  psamfinder scan C:\path\to\dir

- Scan and interactively delete duplicates (asks which file to keep per group):
  psamfinder scan C:\path\to\dir --delete

- Preview deletion without actually removing files (dry-run mode)
  psamfinder scan C:\path\to\dir --delete --dry-run
  # Shows "Would have deleted: ..." for each file that would be removed

- Show installed version:
  psamfinder --version

- Quiet scan (suppresses the scanning line):
  psamfinder scan C:\path\to\dir -q

## What the code does (line-level summary)

Files of interest:
- pyproject.toml
  - Project metadata: name `psamfinder`, version `0.3.2`, description "Command-line tool to find and optionally delete duplicate files by content (SHA-256)".
  - Entry point: `psamfinder = "psamfinder.cli:app"` (Typer app). The CLI supports a `--version` / `-V` option that prints the installed psamfinder version and exits.
  - Build system: hatchling.

- psamfinder/__main__.py
  - Imports `app` from psamfinder.cli and calls `sys.exit(app())` so `python -m psamfinder` runs the CLI.

- psamfinder/cli.py
  - Uses Typer to create a CLI app named `psamfinder` with help text.
  - Exposes a `scan` command that accepts:
    - directory: pathlib.Path (must exist, resolved, must be a directory)
    - --delete / -d: boolean option to enable interactive deletion after listing
    - --quiet / -q: boolean option to suppress the scanning message
  - Behavior:
    - Prints "Scanning: <directory> ..." unless -q is used (lines 51-52).
    - Calls `find_duplicates(str(directory))` from psamfinder.finder (line 54).
    - If no duplicates are found, prints "No duplicates found" and exits with code 0 (lines 56-58).
    - Otherwise calls `print_duplicates(duplicates)`, and if `--delete` was passed, asks for confirmation and calls `delete_duplicates(duplicates)`.

- psamfinder/finder.py
  - compute_hash(filepath)
    - Computes SHA-256 digest of file content in 4 KiB chunks (line 12 uses 4096 bytes).
    - Returns the hex digest string, or `None` if a PermissionError or FileNotFoundError occurs (lines 15-17). Errors are printed to stderr.
  - find_duplicates(directory)
    - Walks the directory recursively with os.walk (line 25).
    - For every file, builds its absolute path and computes its SHA-256 hash using `compute_hash`.
    - Collects paths in a dict mapping hash -> list of file paths (lines 24, 30-32).
    - Returns a dictionary of only the hashes that have 2 or more files (duplicates) (line 34).
    - Return type: dict[str, list[str]] where keys are hex SHA-256 strings and values are lists of file paths.
  - print_duplicates(duplicates)
    - Nicely prints a header and then each duplicate group showing the shared hash and the file paths (lines 42-47).
    - If duplicates is empty or falsy, prints "No duplicates found" and returns (lines 39-41).
  - delete_duplicates(duplicates)
    - For each duplicate group, lists the files with indices and prompts the user to enter the number of the file to keep (line 55), or type `skip` to keep all.
    - If a valid index is provided, removes all other files in that group using `os.remove` and prints the path deleted (lines 61-64).
    - If input is invalid (non-integer or out of bounds) it prints a message and skips deletion for that group (lines 65-68).

## Notes, gotchas, and suggestions
- compute_hash uses a 4 KiB read buffer; this is a reasonable trade-off between memory usage and speed. 
- Files that cannot be read due to permissions or that disappear during scanning are skipped and reported to stderr by compute_hash (the hash function returns None on these errors).
- Deletion is interactive and destructive  always use --dry-run first when testing
- Use backups or version control before running with --delete without --dry-run
- Duplicates are detected strictly by content hash  identical content but different metadata is still considered duplicate
- --dry-run only affects deletion; scanning and listing always happen normally
- The CLI `--delete` option first requests a confirmation prompt; deletion then asks which single file to keep in each group.
- The tool identifies duplicates strictly by file content hash. Files with identical content but different metadata (timestamps, permissions, names) are considered duplicates.

## Packaging
- The project is configured with pyproject.toml; the package includes the `psamfinder` module and exposes a console script in [project.scripts]. Use `python -m build` or `hatch build` in a properly configured environment to build a wheel.

## Extending / Contributing
- Possible improvements:
  - Add unit tests for compute_hash and find_duplicates.
  - Add options to automatically pick which files to keep (e.g., keep newest, keep largest, keep by pattern) for non-interactive deletion.
  - Add progress reporting for large scans, or parallel hashing for performance.
- Contributions via pull requests are welcome. Add tests and update the README with usage examples for new features.

## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Contact
Author:
- Marvinphil Annorbah(psam) (GitHub: [@psam-717](https://github.com/psam-717))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

psamfinder-0.3.2.tar.gz (6.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

psamfinder-0.3.2-py3-none-any.whl (8.2 kB view details)

Uploaded Python 3

File details

Details for the file psamfinder-0.3.2.tar.gz.

File metadata

  • Download URL: psamfinder-0.3.2.tar.gz
  • Upload date:
  • Size: 6.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.3 cpython/3.14.0 HTTPX/0.28.1

File hashes

Hashes for psamfinder-0.3.2.tar.gz
Algorithm Hash digest
SHA256 2737b2113e07dc09800ba034cd60681f9428902ca4ab341dd9bbfff194ca36d4
MD5 5293fef9c176b0d4eee1cf49f5f7903e
BLAKE2b-256 108ac34074dd517819c04e6c51e2182cd3f82d42eb09b626cc0d06e427381e1b

See more details on using hashes here.

File details

Details for the file psamfinder-0.3.2-py3-none-any.whl.

File metadata

  • Download URL: psamfinder-0.3.2-py3-none-any.whl
  • Upload date:
  • Size: 8.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.3 cpython/3.14.0 HTTPX/0.28.1

File hashes

Hashes for psamfinder-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f7444a1f6f3237b671b1273e6ea473f34cbf1af7c676e46a0fee3a6ba0396b2a
MD5 0667aa8790c0d21a5cb51b4695d0d963
BLAKE2b-256 1bdd61bd19fc6c89c1510409d06bd878e8da7b1900321678ac18a39a9273be00

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page