Workspace-based ISCC content inventory and similarity clustering tool

Reason this release was yanked:

noise

Project description

kmapper-iscc-scan

Workspace-based ISCC content inventory and similarity clustering tool.

Developed by kmapper GmbH — not related to the k-means mapper algorithm from topological data analysis.

How it works

This tool is built on the ISCC (International Standard Content Code), an ISO standard (ISO 24138) for content-derived, decentralized media identifiers.

Scanning walks a directory recursively and generates an ISCC for each supported file. The ISCC encodes information about the file's content — not just its name or hash — so two files with different names but identical content will produce the same code. Each result is written as a sidecar .iscc.json file next to (or mirroring) the original file in the workspace.

Compiling aggregates all sidecar files from all scans into a single CSV inventory. It uses the Content Unit embedded in each ISCC to cluster files by similarity via Hamming distance. The result lets you identify:

Exact duplicates — same content, regardless of filename
Near-duplicates and similar content — e.g. a .pptx presentation and its .pdf handout will appear in the same cluster

Installation

1. Check your Python version

python3 --version

You need Python 3.12 or higher. If the command is not found or the version is too old, see below.

Installing Python

macOS / Linux: Download from python.org or use your system package manager.

Windows: The easiest option is uv, a fast Python manager written in Rust:

# Install uv
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

# Install Python 3.12
uv python install 3.12

2. Install the package

pip install kmapper-iscc-scan

Usage

Scanning

Scan a directory recursively (i.e. including all subdirectories) to generate an ISCC for each relevant file and store the metadata files in your given workspace directory:

# Scan a directory into a workspace
kmapper-iscc-scan scan /path/to/content /path/to/workspace

The above command will create a default batch name for your scanned directory. You can optionally determine your own batch name with:

kmapper-iscc-scan scan /path/to/content /path/to/workspace --batch my-batch

Compiling

Compile all the metadata files from all your different scans into one inventory CSV file, including clustering of identical or similar files:

# Compile an inventory CSV with similarity clustering
kmapper-iscc-scan compile /path/to/workspace

The above command will use a default Hamming distance of 10 (i.e. approx. 84.38% similarity). This means files with a Hamming distance of 10 will be considered to be in a cluster of files with similar content. You can optionally set your own threshold for the Hamming distance or indicate a similarity threshold in percent:

# Optionally compile with your own threshold for the Hamming distance
kmapper-iscc-scan compile /path/to/workspace --threshold 15 # Hamming distance of 15

# Optionally compile with your own similarity threshold given in percent
kmapper-iscc-scan compile /path/to/workspace --similarity 90 # Files with a similarity of 90% will be in the same content cluster

The inventory CSV

The CSV contains one row per file. The three grouping columns follow a hierarchy — each level is a subset of the one below:

instance_group — files that are byte-for-byte identical (exact copies, regardless of filename or location). Files in the same instance group are always also in the same data group.
data_group — files with the same data structure (e.g. the same PDF re-saved with slightly different metadata, causing the raw bytes to differ). Files in the same data group are always also in the same content cluster.
content_cluster — files with similar content regardless of format or encoding (e.g. a .pptx presentation and its .pdf handout). This is the broadest grouping.

Checking the Scans

Check which directories have already been scanned:

# Show workspace status
kmapper-iscc-scan status /path/to/workspace

Workspace structure

workspace/
  scan_log.json
  sidecars/
    my-batch/
      subdir/
        file.pdf.iscc.json
  iscc_inventory.csv

License

MIT

Project details

Release history Release notifications | RSS feed

0.2.1

Mar 25, 2026

0.2.0

Mar 24, 2026

0.1.3 yanked

Mar 23, 2026

Reason this release was yanked:

noise

This version

0.1.2 yanked

Mar 23, 2026

Reason this release was yanked:

noise

0.1.1 yanked

Mar 23, 2026

Reason this release was yanked:

noise

0.1.0 yanked

Mar 23, 2026

Reason this release was yanked:

noise

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kmapper_iscc_scan-0.1.2.tar.gz (7.7 kB view details)

Uploaded Mar 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kmapper_iscc_scan-0.1.2-py3-none-any.whl (9.4 kB view details)

Uploaded Mar 23, 2026 Python 3

File details

Details for the file kmapper_iscc_scan-0.1.2.tar.gz.

File metadata

Download URL: kmapper_iscc_scan-0.1.2.tar.gz
Upload date: Mar 23, 2026
Size: 7.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for kmapper_iscc_scan-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`e956c86a8567690cf756852c29d6ab11ed0626a79d7c0dc1275a89f10f67e2f0`
MD5	`69d35107fa4e8e860bad7e6b3f6fcbf4`
BLAKE2b-256	`f9f4303776a662e7a959ef64c0581deccca97c75153cb94d06c980e93d1e841b`

See more details on using hashes here.

File details

Details for the file kmapper_iscc_scan-0.1.2-py3-none-any.whl.

File metadata

Download URL: kmapper_iscc_scan-0.1.2-py3-none-any.whl
Upload date: Mar 23, 2026
Size: 9.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for kmapper_iscc_scan-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`47bd84fbfc97473d9b16f2dc777031ae483b0417d16b4c8c6b8364a9edb69d0d`
MD5	`05041dbeff049ebcdae5db5f0187f5a0`
BLAKE2b-256	`cea7f9a6370890235260498ce51c5fdb6294f1444e1b723332c6e995e494889e`

See more details on using hashes here.

kmapper-iscc-scan 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

kmapper-iscc-scan

How it works

Installation

1. Check your Python version

2. Install the package

Usage

Scanning

Compiling

The inventory CSV

Checking the Scans

Workspace structure

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes