Workspace-based ISCC content inventory and similarity clustering tool
Reason this release was yanked:
noise
Project description
kmapper-iscc-scan
Workspace-based ISCC content inventory and similarity clustering tool.
Developed by kmapper GmbH — not related to the k-means mapper algorithm from topological data analysis.
How it works
This tool is built on the ISCC (International Standard Content Code), an ISO standard (ISO 24138) for content-derived, decentralized media identifiers.
Scanning walks a directory recursively and generates an ISCC for each supported file. The ISCC encodes information about the file's content — not just its name or hash — so two files with different names but identical content will produce the same code. Each result is written as a sidecar .iscc.json file next to (or mirroring) the original file in the workspace.
Compiling aggregates all sidecar files from all scans into a single CSV inventory. It uses the Content Unit embedded in each ISCC to cluster files by similarity via Hamming distance. The result lets you identify:
- Exact duplicates — same content, regardless of filename
- Near-duplicates and similar content — e.g. a
.pptxpresentation and its.pdfhandout will appear in the same cluster
Installation
1. Check your Python version
python3 --version
You need Python 3.12 or higher. If the command is not found or the version is too old, see below.
Installing Python
macOS / Linux: Download from python.org or use your system package manager.
Windows: The easiest option is uv, a fast Python manager written in Rust:
# Install uv
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
# Install Python 3.12
uv python install 3.12
2. Install the package
pip install kmapper-iscc-scan
Usage
# Scan a directory into a workspace
kmapper-iscc-scan scan /path/to/content /path/to/workspace
# Optionally scan with a custom named batch
kmapper-iscc-scan scan /path/to/content /path/to/workspace --batch myBatch
# Compile an inventory CSV with similarity clustering
kmapper-iscc-scan compile /path/to/workspace # This uses a default Hamming distance of 10 (e.g. approx. 84.38%)
# Optionally compile with your own threshold for the Hamming distance
kmapper-iscc-scan compile /path/to/workspace --threshold 15
# Optionally compile with your own similarity threshold given in percent
kmapper-iscc-scan compile /path/to/workspace --similarity 90
# Show workspace status
kmapper-iscc-scan status /path/to/workspace
Workspace structure
workspace/
scan_log.json
sidecars/
my-batch/
subdir/
file.pdf.iscc.json
iscc_inventory.csv
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kmapper_iscc_scan-0.1.1.tar.gz.
File metadata
- Download URL: kmapper_iscc_scan-0.1.1.tar.gz
- Upload date:
- Size: 7.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4a8c56d4bca8eb14d7452facd66de01deb19606c8f3d8e2b72f2e36fce24c4d2
|
|
| MD5 |
910c2ff95b3ac319e690da00983967d7
|
|
| BLAKE2b-256 |
3f0d24d74d85e482f4efd33b871189af409c6149d941f35283ca8b02ce5f34ed
|
File details
Details for the file kmapper_iscc_scan-0.1.1-py3-none-any.whl.
File metadata
- Download URL: kmapper_iscc_scan-0.1.1-py3-none-any.whl
- Upload date:
- Size: 8.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
93607c4a5343c1edb3e877fc371bc29bbbd48e63786fbbea9b69b0cbc8ff10d7
|
|
| MD5 |
7e575c0f8a053675a8980b77a6e4f7a5
|
|
| BLAKE2b-256 |
c9cd045a3a6b3590e68ea56f8b6c4adc41b57990496fd8af7f8b7c22a051279b
|