Skip to main content

Command-line tool to find and optionally delete duplicate files by content (SHA-256)

Project description

psamfinder — File duplicate finder

PyPI Python

psamfinder is a lightweight CLI tool that recursively scans directories for exact duplicate files (using SHA-256 hashing) and near-duplicate images (using perceptual hashing when enabled).

Requirements

  • Python 3.8+
  • hatchling (for building, referenced in pyproject.toml)

Installation

From PyPI (recommended):

pip install psamfinder
# or for isolated CLI install (recommended)
pipx install psamfinder

# With fuzzy (perceptual) image duplicate detection support:
pip install "psamfinder[fuzzy]"
# or
pipx install "psamfinder[fuzzy]"


# For development/ from source
git clone https://github.com/psam-717/psamfinder.git
cd psamfinder
pip install -e .
pip install -e ".[fuzzy]" # with fuzzy image support


## Running
- Basic scan (exact duplicates only)
  psamfinder scan <DIRECTORY>

- Scan + interactive deletion
  psamfinder scan <DIRECTORY> --delete

- Dry-run deletion preview
psamfinder scan <DIRECTORY> --delete --dry-run

- Quiet mode (no "Scanning..." message)
psamfinder scan <DIRECTORY> -q

- Fuzzy/perceptual image duplicate detection (near-duplicates, resized/cropped, etc.)
psamfinder scan <DIRECTORY> --fuzzy-images --similarity-threshold 0.82

- Help choose a good similarity threshold by analyzing your images
psamfinder threshold <DIRECTORY> [--max-images 300] [--verbose]

Examples:
- List exact duplicates
psamfinder scan ~/Photos

- Find near-duplicate photos (good for resized/edited versions)
psamfinder scan ~/Photos --fuzzy-images --similarity-threshold 0.80

- Analyze similarity distribution to pick a threshold
psamfinder threshold ~/Photos --max-images 500 --verbose

- Dry-run deletion of exact duplicates
psamfinder scan ~/Downloads --delete --dry-run

- Show version
psamfinder --version

## How the code works (high-level overview)

**Key files & responsibilities**

- `pyproject.toml`
  - Project metadata, version (now 0.3.6), MIT license
  - Console entry point: `psamfinder = "psamfinder.cli:app"`
  - Optional `[fuzzy]` extra: `imagehash` + `pillow` for perceptual image detection

- `psamfinder/cli.py`
  - Typer-based CLI with two commands:
    - `scan`  finds duplicates (exact or fuzzy), lists them, offers interactive deletion
      Flags: `--delete`, `--dry-run`, `--quiet`, `--fuzzy-images`, `--similarity-threshold`
    - `threshold`  analyzes pairwise image similarities to help choose a good fuzzy threshold
      Flags: `--max-images`, `--quiet`, `--verbose`
  - `--version` / `-V` shows package version

- `psamfinder/finder.py`
  - `compute_hash()`  SHA-256 of file content (4 KiB chunks), skips on permission/IO errors
  - `find_duplicates(directory, fuzzy_images=False, similarity_threshold=0.80)`
    - **Exact mode** (default): groups files by identical SHA-256 hash  `List[List[str]]`
    - **Fuzzy mode** (`--fuzzy-images`): uses perceptual hashing (`phash`) on images only
      - Groups near-duplicates using union-find + Hamming distance threshold
      - Returns `List[List[str]]` of similar-image groups
  - `print_duplicates(dupe_groups: List[List[str]])`  clean grouped output
  - `delete_duplicates(dupe_groups: List[List[str]], dry_run=False)`  interactive keep/skip per group

**Main behavioral changes**
- Duplicate groups are now consistently returned and handled as `List[List[str]]` (no more hash dict)
- Fuzzy mode requires `pip install psamfinder[fuzzy]` and only processes common image formats
- New `threshold` command helps tune `--similarity-threshold` by showing similar pairs and distribution

## Important notes & gotchas
- Always test with `--dry-run`  deletion is interactive and permanent
- Make backups before using `--delete` without `--dry-run`
- Exact mode ignores metadata (only content matters)
- Fuzzy mode is perceptual  good for resized/cropped/recompressed images, but may include false positives depending on threshold
- `threshold` command is read-only (no deletion)
- Skipped files (permissions, corrupt images, etc.) are logged to stderr

## Packaging
Configured with `pyproject.toml` + hatchling.  
Build: `hatch build` or `python -m build`

## Contributing & future ideas
- Add tests (hashing, grouping, fuzzy logic, deletion flows)
- Auto-keep rules (newest/largest/shortest-path/regex)
- Progress bar or parallel processing for large directories
- JSON/CSV report export
- Better error handling & summary stats

Pull requests welcome  include tests and update README examples for new features.

## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Contact
Author:
- Marvinphil Annorbah(psam) (GitHub: [@psam-717](https://github.com/psam-717))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

psamfinder-0.3.6.tar.gz (8.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

psamfinder-0.3.6-py3-none-any.whl (10.5 kB view details)

Uploaded Python 3

File details

Details for the file psamfinder-0.3.6.tar.gz.

File metadata

  • Download URL: psamfinder-0.3.6.tar.gz
  • Upload date:
  • Size: 8.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.3 cpython/3.14.0 HTTPX/0.28.1

File hashes

Hashes for psamfinder-0.3.6.tar.gz
Algorithm Hash digest
SHA256 b307c0ef0625eab08509c980bb8e2b71f7b47fe20ac7dd25132c865400fa8531
MD5 cc3796bfd74a2f20b77a68a6c2fac219
BLAKE2b-256 9aa07807a3ddc3ee01960c42a997d92ccf40d2ac2d3b1147bdbe76c00fff7dd0

See more details on using hashes here.

File details

Details for the file psamfinder-0.3.6-py3-none-any.whl.

File metadata

  • Download URL: psamfinder-0.3.6-py3-none-any.whl
  • Upload date:
  • Size: 10.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Hatch/1.16.3 cpython/3.14.0 HTTPX/0.28.1

File hashes

Hashes for psamfinder-0.3.6-py3-none-any.whl
Algorithm Hash digest
SHA256 587609748f3670d9dd6c7bfd1077e0c5afc8eae3e165aa759e4f937decc270c3
MD5 c8b2210b511c91c88962c4166334cf55
BLAKE2b-256 1cad23c00c5dfb1581573cd68b116ed190cdc32ce0480d2de49ab8b6ab38abdb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page