Command-line tool to find and optionally delete duplicate files by content (SHA-256)
Project description
psamfinder — File duplicate finder
psamfinder is a lightweight CLI tool that recursively scans directories for exact duplicate files (using SHA-256 hashing) and near-duplicate images (using perceptual hashing when enabled).
Requirements
- Python 3.8+
- hatchling (for building, referenced in pyproject.toml)
Installation
From PyPI (recommended):
pip install psamfinder
# or for isolated CLI install (recommended)
pipx install psamfinder
# With fuzzy (perceptual) image duplicate detection support:
pip install "psamfinder[fuzzy]"
# or
pipx install "psamfinder[fuzzy]"
# For development/ from source
git clone https://github.com/psam-717/psamfinder.git
cd psamfinder
pip install -e .
pip install -e ".[fuzzy]" # with fuzzy image support
## Running
- Basic scan (exact duplicates only)
psamfinder scan <DIRECTORY>
- Scan + interactive deletion
psamfinder scan <DIRECTORY> --delete
- Dry-run deletion preview
psamfinder scan <DIRECTORY> --delete --dry-run
- Quiet mode (no "Scanning..." message)
psamfinder scan <DIRECTORY> -q
- Fuzzy/perceptual image duplicate detection (near-duplicates, resized/cropped, etc.)
psamfinder scan <DIRECTORY> --fuzzy-images --similarity-threshold 0.82
- Help choose a good similarity threshold by analyzing your images
psamfinder threshold <DIRECTORY> [--max-images 300] [--verbose]
Examples:
- List exact duplicates
psamfinder scan ~/Photos
- Find near-duplicate photos (good for resized/edited versions)
psamfinder scan ~/Photos --fuzzy-images --similarity-threshold 0.80
- Analyze similarity distribution to pick a threshold
psamfinder threshold ~/Photos --max-images 500 --verbose
- Dry-run deletion of exact duplicates
psamfinder scan ~/Downloads --delete --dry-run
- Show version
psamfinder --version
## How the code works (high-level overview)
**Key files & responsibilities**
- `pyproject.toml`
- Project metadata, version (now 0.3.6), MIT license
- Console entry point: `psamfinder = "psamfinder.cli:app"`
- Optional `[fuzzy]` extra: `imagehash` + `pillow` for perceptual image detection
- `psamfinder/cli.py`
- Typer-based CLI with two commands:
- `scan` — finds duplicates (exact or fuzzy), lists them, offers interactive deletion
Flags: `--delete`, `--dry-run`, `--quiet`, `--fuzzy-images`, `--similarity-threshold`
- `threshold` — analyzes pairwise image similarities to help choose a good fuzzy threshold
Flags: `--max-images`, `--quiet`, `--verbose`
- `--version` / `-V` shows package version
- `psamfinder/finder.py`
- `compute_hash()` — SHA-256 of file content (4 KiB chunks), skips on permission/IO errors
- `find_duplicates(directory, fuzzy_images=False, similarity_threshold=0.80)`
- **Exact mode** (default): groups files by identical SHA-256 hash → `List[List[str]]`
- **Fuzzy mode** (`--fuzzy-images`): uses perceptual hashing (`phash`) on images only
- Groups near-duplicates using union-find + Hamming distance threshold
- Returns `List[List[str]]` of similar-image groups
- `print_duplicates(dupe_groups: List[List[str]])` — clean grouped output
- `delete_duplicates(dupe_groups: List[List[str]], dry_run=False)` — interactive keep/skip per group
**Main behavioral changes**
- Duplicate groups are now consistently returned and handled as `List[List[str]]` (no more hash dict)
- Fuzzy mode requires `pip install psamfinder[fuzzy]` and only processes common image formats
- New `threshold` command helps tune `--similarity-threshold` by showing similar pairs and distribution
## Important notes & gotchas
- Always test with `--dry-run` — deletion is interactive and permanent
- Make backups before using `--delete` without `--dry-run`
- Exact mode ignores metadata (only content matters)
- Fuzzy mode is perceptual — good for resized/cropped/recompressed images, but may include false positives depending on threshold
- `threshold` command is read-only (no deletion)
- Skipped files (permissions, corrupt images, etc.) are logged to stderr
## Packaging
Configured with `pyproject.toml` + hatchling.
Build: `hatch build` or `python -m build`
## Contributing & future ideas
- Add tests (hashing, grouping, fuzzy logic, deletion flows)
- Auto-keep rules (newest/largest/shortest-path/regex)
- Progress bar or parallel processing for large directories
- JSON/CSV report export
- Better error handling & summary stats
Pull requests welcome — include tests and update README examples for new features.
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Contact
Author:
- Marvinphil Annorbah(psam) (GitHub: [@psam-717](https://github.com/psam-717))
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
psamfinder-0.3.6.tar.gz
(8.6 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file psamfinder-0.3.6.tar.gz.
File metadata
- Download URL: psamfinder-0.3.6.tar.gz
- Upload date:
- Size: 8.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: Hatch/1.16.3 cpython/3.14.0 HTTPX/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b307c0ef0625eab08509c980bb8e2b71f7b47fe20ac7dd25132c865400fa8531
|
|
| MD5 |
cc3796bfd74a2f20b77a68a6c2fac219
|
|
| BLAKE2b-256 |
9aa07807a3ddc3ee01960c42a997d92ccf40d2ac2d3b1147bdbe76c00fff7dd0
|
File details
Details for the file psamfinder-0.3.6-py3-none-any.whl.
File metadata
- Download URL: psamfinder-0.3.6-py3-none-any.whl
- Upload date:
- Size: 10.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: Hatch/1.16.3 cpython/3.14.0 HTTPX/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
587609748f3670d9dd6c7bfd1077e0c5afc8eae3e165aa759e4f937decc270c3
|
|
| MD5 |
c8b2210b511c91c88962c4166334cf55
|
|
| BLAKE2b-256 |
1cad23c00c5dfb1581573cd68b116ed190cdc32ce0480d2de49ab8b6ab38abdb
|