Skip to main content

Workspace-based ISCC content inventory and similarity clustering tool

Reason this release was yanked:

noise

Project description

kmapper-iscc-scan

Workspace-based ISCC content inventory and similarity clustering tool.

Developed by kmapper GmbH — not related to the k-means mapper algorithm from topological data analysis.

How it works

This tool is built on the ISCC (International Standard Content Code), an ISO standard (ISO 24138) for content-derived, decentralized media identifiers.

Scanning walks a directory recursively and generates an ISCC for each supported file. The ISCC encodes information about the file's content — not just its name or hash — so two files with different names but identical content will produce the same code. Each result is written as a sidecar .iscc.json file next to (or mirroring) the original file in the workspace.

Compiling aggregates all sidecar files from all scans into a single CSV inventory. It uses the Content Unit embedded in each ISCC to cluster files by similarity via Hamming distance. The result lets you identify:

  • Exact duplicates — same content, regardless of filename
  • Near-duplicates and similar content — e.g. a .pptx presentation and its .pdf handout will appear in the same cluster

Installation

1. Check your Python version

python3 --version

You need Python 3.12 or higher. If the command is not found or the version is too old, see below.

Installing Python

macOS / Linux: Download from python.org or use your system package manager.

Windows: The easiest option is uv, a fast Python manager written in Rust:

# Install uv
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

# Install Python 3.12
uv python install 3.12

2. Install the package

pip install kmapper-iscc-scan

Usage

# Scan a directory into a workspace
kmapper-iscc-scan scan /path/to/content /path/to/workspace

# Optionally scan with a custom named batch
kmapper-iscc-scan scan /path/to/content /path/to/workspace --batch myBatch

# Compile an inventory CSV with similarity clustering
kmapper-iscc-scan compile /path/to/workspace # This uses a default Hamming distance of 10 (e.g. approx. 84.38%)

# Optionally compile with your own threshold for the Hamming distance
kmapper-iscc-scan compile /path/to/workspace --threshold 15

# Optionally compile with your own similarity threshold given in percent
kmapper-iscc-scan compile /path/to/workspace --similarity 90

# Show workspace status
kmapper-iscc-scan status /path/to/workspace

Workspace structure

workspace/
  scan_log.json
  sidecars/
    my-batch/
      subdir/
        file.pdf.iscc.json
  iscc_inventory.csv

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kmapper_iscc_scan-0.1.1.tar.gz (7.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kmapper_iscc_scan-0.1.1-py3-none-any.whl (8.8 kB view details)

Uploaded Python 3

File details

Details for the file kmapper_iscc_scan-0.1.1.tar.gz.

File metadata

  • Download URL: kmapper_iscc_scan-0.1.1.tar.gz
  • Upload date:
  • Size: 7.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for kmapper_iscc_scan-0.1.1.tar.gz
Algorithm Hash digest
SHA256 4a8c56d4bca8eb14d7452facd66de01deb19606c8f3d8e2b72f2e36fce24c4d2
MD5 910c2ff95b3ac319e690da00983967d7
BLAKE2b-256 3f0d24d74d85e482f4efd33b871189af409c6149d941f35283ca8b02ce5f34ed

See more details on using hashes here.

File details

Details for the file kmapper_iscc_scan-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for kmapper_iscc_scan-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 93607c4a5343c1edb3e877fc371bc29bbbd48e63786fbbea9b69b0cbc8ff10d7
MD5 7e575c0f8a053675a8980b77a6e4f7a5
BLAKE2b-256 c9cd045a3a6b3590e68ea56f8b6c4adc41b57990496fd8af7f8b7c22a051279b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page