Skip to main content

Command-line tool to find and optionally delete duplicate files by content (SHA-256)

Project description

psamfinder — File duplicate finder

PyPI Python

psamfinder is a lightweight CLI tool that recursively scans directories for files with identical content (using SHA-256 hashing) and helps you manage duplicates interactively.

Requirements

  • Python 3.8+
  • hatchling (for building, referenced in pyproject.toml)

Installation

From PyPI (recommended):

pip install psamfinder
# or for isolated CLI install (recommended)
pipx install psamfinder


# For development/ from source
git clone https://github.com/psam-717/psamfinder.git
cd psamfinder
pip install -e .


## Running
- As a CLI (installed entry point):
  psamfinder scan <DIRECTORY> [--delete] [--dry-run] [-q]

- From source:
  python -m psamfinder scan <DIRECTORY> [--delete] [--dry-run] [-q]

Examples:
- Scan a directory and list duplicates:
  psamfinder scan C:\path\to\dir

- Scan and interactively delete duplicates (asks which file to keep per group):
  psamfinder scan C:\path\to\dir --delete

- Preview deletion without actually removing files (dry-run mode)
  psamfinder scan C:\path\to\dir --delete --dry-run
  # Shows "Would have deleted: ..." for each file that would be removed

- Quiet scan (suppresses the scanning line):
  psamfinder scan C:\path\to\dir -q

## What the code does (line-level summary)

Files of interest:
- pyproject.toml
  - Project metadata: name `psamfinder`, version `0.1.0`, description "File duplicate finder".
  - Entry point: `psamfinder = "psamfinder.cli:app"` (Typer app).
  - Build system: hatchling.

- psamfinder/__main__.py
  - Imports `app` from psamfinder.cli and calls `sys.exit(app())` so `python -m psamfinder` runs the CLI.

- psamfinder/cli.py
  - Uses Typer to create a CLI app named `psamfinder` with help text.
  - Exposes a `scan` command that accepts:
    - directory: pathlib.Path (must exist, resolved, must be a directory)
    - --delete / -d: boolean option to enable interactive deletion after listing
    - --quiet / -q: boolean option to suppress the scanning message
  - Behavior:
    - Prints "Scanning: <directory> ..." unless -q is used (lines 51-52).
    - Calls `find_duplicates(str(directory))` from psamfinder.finder (line 54).
    - If no duplicates are found, prints "No duplicates found" and exits with code 0 (lines 56-58).
    - Otherwise calls `print_duplicates(duplicates)`, and if `--delete` was passed, asks for confirmation and calls `delete_duplicates(duplicates)`.

- psamfinder/finder.py
  - compute_hash(filepath)
    - Computes SHA-256 digest of file content in 4 KiB chunks (line 12 uses 4096 bytes).
    - Returns the hex digest string, or `None` if a PermissionError or FileNotFoundError occurs (lines 15-17). Errors are printed to stderr.
  - find_duplicates(directory)
    - Walks the directory recursively with os.walk (line 25).
    - For every file, builds its absolute path and computes its SHA-256 hash using `compute_hash`.
    - Collects paths in a dict mapping hash -> list of file paths (lines 24, 30-32).
    - Returns a dictionary of only the hashes that have 2 or more files (duplicates) (line 34).
    - Return type: dict[str, list[str]] where keys are hex SHA-256 strings and values are lists of file paths.
  - print_duplicates(duplicates)
    - Nicely prints a header and then each duplicate group showing the shared hash and the file paths (lines 42-47).
    - If duplicates is empty or falsy, prints "No duplicates found" and returns (lines 39-41).
  - delete_duplicates(duplicates)
    - For each duplicate group, lists the files with indices and prompts the user to enter the number of the file to keep (line 55), or type `skip` to keep all.
    - If a valid index is provided, removes all other files in that group using `os.remove` and prints the path deleted (lines 61-64).
    - If input is invalid (non-integer or out of bounds) it prints a message and skips deletion for that group (lines 65-68).

## Notes, gotchas, and suggestions
- compute_hash uses a 4 KiB read buffer; this is a reasonable trade-off between memory usage and speed. 
- Files that cannot be read due to permissions or that disappear during scanning are skipped and reported to stderr by compute_hash (the hash function returns None on these errors).
- Deletion is interactive and destructive  always use --dry-run first when testing
- Use backups or version control before running with --delete without --dry-run
- Duplicates are detected strictly by content hash  identical content but different metadata is still considered duplicate
- --dry-run only affects deletion; scanning and listing always happen normally
- The CLI `--delete` option first requests a confirmation prompt; deletion then asks which single file to keep in each group.
- The tool identifies duplicates strictly by file content hash. Files with identical content but different metadata (timestamps, permissions, names) are considered duplicates.

## Packaging
- The project is configured with pyproject.toml; the package includes the `psamfinder` module and exposes a console script in [project.scripts]. Use `python -m build` or `hatch build` in a properly configured environment to build a wheel.

## Extending / Contributing
- Possible improvements:
  - Add unit tests for compute_hash and find_duplicates.
  - Add options to automatically pick which files to keep (e.g., keep newest, keep largest, keep by pattern) for non-interactive deletion.
  - Add progress reporting for large scans, or parallel hashing for performance.
- Contributions via pull requests are welcome. Add tests and update the README with usage examples for new features.

## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Contact
Author:
- Marvinphil Annorbah(psam) (GitHub: [@psam-717](https://github.com/psam-717))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

psamfinder-0.3.1.tar.gz (6.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

psamfinder-0.3.1-py3-none-any.whl (8.0 kB view details)

Uploaded Python 3

File details

Details for the file psamfinder-0.3.1.tar.gz.

File metadata

  • Download URL: psamfinder-0.3.1.tar.gz
  • Upload date:
  • Size: 6.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.3 cpython/3.14.0 HTTPX/0.28.1

File hashes

Hashes for psamfinder-0.3.1.tar.gz
Algorithm Hash digest
SHA256 d4d1dad088790ba20f5a709956c4d4d103145ca7bf3dacc9f943d4816dbb3a5c
MD5 03bd0af1e1017e128169488d311a289d
BLAKE2b-256 73df0fb9a40f4104eba4712cc908c910a16c11200d76a0b58e3977abb76c4c82

See more details on using hashes here.

File details

Details for the file psamfinder-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: psamfinder-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 8.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.3 cpython/3.14.0 HTTPX/0.28.1

File hashes

Hashes for psamfinder-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 11c5c9b386ea6e8fbb8d1bbc2a6f04aa51a405b0383a77e7d7a80a48d218e8b4
MD5 d876a6facdae7b6f0a814ddb34de48d1
BLAKE2b-256 ef8f209c6eb7e1c13ad0cd6455cd6f054664cd290341e51eacfcc9fb865e6846

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page