Skip to main content

High-performance ISCC Data-Code and Instance-Code hashing

Project description

iscc-sum

CI PyPI Crates.io

A blazing-fast ISCC Data-Code and Instance-Code hashing tool built in Rust with Python bindings. Delivers 50-130x faster performance than reference implementations, processing data at over 1 GB/s.

Originally created to handle massive microscopic imaging datasets where existing tools were too slow.

Project Status

Version 0.1.0 — Initial release for Data-Code and Instance-Code generation.

[!NOTE] By default, this tool creates ISCC-CODEs of SubType WIDE, introduced for large-scale secure checksum support with data similarity matching capabilities. This SubType is not yet part of the ISO 24138:2024 standard but is supported by the latest version of the Iscc-Core reference implementation. For ISO 24138:2024 conformant ISCC-CODEs, use the --narrow flag in the CLI tool.

Performance

  • 950-1050 MB/s processing speed (vs 7-8 MB/s reference)
  • 50-130x faster than existing implementations
  • Consistent performance on multi-GB files

Ideal for large-scale data processing: microscopic imaging, video files, scientific datasets.

Installation

Python Package

The recommended way to install the iscc-sum CLI tool is using uv:

uv tool install iscc-sum

Note: To install uv, run: curl -LsSf https://astral.sh/uv/install.sh | sh (or see other installation methods)

Alternatively, install from PyPI:

pip install iscc-sum

Rust CLI Tool

Install from crates.io:

cargo install iscc-sum

Or download pre-built binaries from the releases page.

Usage

Command Line Interface

The iscc-sum command provides checksum generation and verification functionality similar to standard tools like md5sum or sha256sum, but using ISCC (International Standard Content Code) checksums.

Basic Usage

# Generate checksum for a file
iscc-sum document.pdf
# Output: ISCC:KACYPXW445FTYNJ3CYSXHAFJMA2HUWULUNRFE3BLHRSCXYH2XHGQY *document.pdf

# Generate checksums for multiple files
iscc-sum *.txt

# Read from standard input
echo "Hello, World!" | iscc-sum
cat document.txt | iscc-sum

Checksum Verification

# Create a checksum file
iscc-sum *.txt > checksums.txt

# Verify checksums
iscc-sum -c checksums.txt
# Output:
# file1.txt: OK
# file2.txt: OK

# Verify with quiet mode (only show failures)
iscc-sum -c -q checksums.txt

Output Formats

# Default format (GNU style)
iscc-sum file.txt
# ISCC:KACYPXW445FTYNJ3CYSXHAFJMA2HUWULUNRFE3BLHRSCXYH2XHGQY *file.txt

# BSD-style format
iscc-sum --tag file.txt
# ISCC (file.txt) = ISCC:KACYPXW445FTYNJ3CYSXHAFJMA2HUWULUNRFE3BLHRSCXYH2XHGQY

# Narrow format (128-bit)
iscc-sum --narrow file.txt
# ISCC:KACYPXW445FTYNJ3CYSXHAFJMA2HU *file.txt

# Show component codes
iscc-sum --units file.txt
# ISCC:KACYPXW445FTYNJ3CYSXHAFJMA2HUWULUNRFE3BLHRSCXYH2XHGQY *file.txt
#   ISCC:EAAW4BQTJSTJSHAI27AJSAGMGHNUKSKRTK3E6OZ5CXUS57SWQZXJQ
#   ISCC:IABXF3ZHYL6O6PM5P2HGV677CS3RBHINZSXEJCITE3WNOTQ2CYXRA

# Process entire directory as single unit
iscc-sum --tree /path/to/project
# ISCC:KACYPXW445FTYNJ3CYSXHAFJMA2HUWULUNRFE3BLHRSCXYH2XHGQY */path/to/project/

Similarity Matching

Find files with similar content:

# Find similar files (default threshold: 12 bits)
iscc-sum --similar *.jpg
# Output:
# photo1.jpg
#   ~8  photo2.jpg
#   ~12 photo3.jpg

# Adjust similarity threshold
iscc-sum --similar --threshold 6 *.pdf

Complete Options

iscc-sum --help  # Show all available options

Options:
-c, --check      Read checksums from files and check them
--narrow         Generate shorter 128-bit checksums
--tag            Create a BSD-style checksum
--units          Show Data-Code and Instance-Code components
-z, --zero       End each output line with NUL
--similar        Find files with similar Data-Codes
--threshold      Hamming distance threshold for similarity (default: 12)
-t, --tree       Process directory as single unit with combined checksum
-q, --quiet      Don't print OK for each verified file
--status         Don't output anything, exit code shows success
-w, --warn       Warn about improperly formatted lines
--strict         Exit non-zero for improperly formatted lines

Examples

See the examples directory for practical scripts demonstrating:

  • Backup verification workflows
  • Duplicate file detection
  • File integrity monitoring
  • Download verification

Rust CLI Tool

A standalone Rust binary is also available:

# Install from crates.io
cargo install iscc-sum

# Run the Rust CLI
isum

Python API

Quick Start

Generate ISCC-SUM codes for files:

>>> from iscc_sum import code_iscc_sum
>>> 
>>> # Generate extended ISCC-SUM for a file
>>> result = code_iscc_sum("LICENSE", wide=True)
>>> result.iscc
'ISCC:K4AA2G6UMXGFJAO6ZOMIFZIYO6LYMOBT7Q6JDI3Z75IJWQY5WH372QA'
>>> result.datahash
'1e203833fc3c91a379ff509b431db1f7fd40dea69a6614249f420ec62398957087b1'
>>> result.filesize
11357

Streaming API

For large files or streaming data, use the processor classes:

from iscc_sum import IsccSumProcessor

processor = IsccSumProcessor()
with open("large_file.bin", "rb") as f:
    while chunk := f.read(1024 * 1024):  # Read in 1MB chunks
        processor.update(chunk)

result = processor.result(wide=False, add_units=True)
print(f"ISCC: {result.iscc}")
print(f"Units: {result.units}")  # Individual Data-Code and Instance-Code

Development

Prerequisites

  • Rust (latest stable) - Install from rustup.rs
  • Python 3.10+
  • UV (for Python dependency management) - Install from astral.sh/uv

Quick Setup

# Clone the repository

git clone https://github.com/bio-codes/iscc-sum.git
cd iscc-sum

# Install Python dependencies
uv sync --all-extras

# Setup Rust development components
uv run poe setup

# Build Python extension and run all checks
uv run poe all

Development Commands

All development tasks are managed through poethepoet:

# One-time setup (installs Rust components)
uv run poe setup

# Pre-commit checks (format, lint, test everything)
uv run poe all

# Individual commands
uv run poe format        # Format all code (Rust + Python)
uv run poe test          # Run all tests (Rust + Python)
uv run poe typecheck     # Run Python type checking
uv run poe rust-build    # Build Rust binary
uv run poe build-ext     # Build Python extension

# Check if Rust toolchain is properly installed
uv run poe check-rust

Manual Setup (if needed)

# Install all dependencies including dev dependencies
uv sync --all-extras

# Install Rust components manually
rustup component add rustfmt clippy

# Build Rust extension for Python
uv run maturin develop

# Run tests manually
cargo test        # Rust tests
uv run pytest     # Python tests

Building

# Build Rust binary (creates isum executable)
cargo build --release

# Build Python wheels
maturin build --release

Funding

This project has received funding from the European Commission's Horizon Europe Research and Innovation programme under grant agreement No. 101129751 as part of the BIO-CODES project (Enhancing AI-Readiness of Bioimaging Data with Content-Based Identifiers).

License

This project is licensed under the Apache License, Version 2.0 - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iscc_sum-0.1.0.tar.gz (1.5 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

iscc_sum-0.1.0-cp310-abi3-win_amd64.whl (200.0 kB view details)

Uploaded CPython 3.10+Windows x86-64

iscc_sum-0.1.0-cp310-abi3-manylinux_2_28_x86_64.whl (306.3 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ x86-64

iscc_sum-0.1.0-cp310-abi3-macosx_11_0_arm64.whl (255.7 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

iscc_sum-0.1.0-cp310-abi3-macosx_10_12_x86_64.whl (283.1 kB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file iscc_sum-0.1.0.tar.gz.

File metadata

  • Download URL: iscc_sum-0.1.0.tar.gz
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for iscc_sum-0.1.0.tar.gz
Algorithm Hash digest
SHA256 dc0071b384e8fb3079fed79bca07c3b17adc8f908f95bc62ed83f6570ceaa6a2
MD5 4a9b18ed2bdcb5657b403fb1289826da
BLAKE2b-256 1e22dd1bc26bc6742bb88f57af4d9e9d5ebca9fba623d372a28ccf076c58f9a8

See more details on using hashes here.

File details

Details for the file iscc_sum-0.1.0-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: iscc_sum-0.1.0-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 200.0 kB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for iscc_sum-0.1.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 64b943e751674137dbea23f7e16874e5776081d5720900be44a143939c856677
MD5 621508c0568ecb8374abb66950cdf0dd
BLAKE2b-256 1da84b645fda17516a63b8196ef0e6a803dbabeb7ea7771e4e6a67fd5b435c2e

See more details on using hashes here.

File details

Details for the file iscc_sum-0.1.0-cp310-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for iscc_sum-0.1.0-cp310-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 913357b1ba4122bbd4a15218cc469ed6da7e73562b13cd19c49e9cf8da84bc0c
MD5 66e0b2b369de6a641143de56d04e5472
BLAKE2b-256 cd87ee5488c5860b4f29e7ba3c6a0f2b2895f2f9fbf91b88e40745dfb72df18e

See more details on using hashes here.

File details

Details for the file iscc_sum-0.1.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for iscc_sum-0.1.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 20077befaa63286f147d97583ff9a6c6649ed981a34b8fb5a223e7b4333a9cce
MD5 dad1ce6c792d24e09dfd5adcd82892e1
BLAKE2b-256 86c13908d6a6ecb0c2bd99a014c98801bd83b1568c1d1ac9c5c610146d4de6c9

See more details on using hashes here.

File details

Details for the file iscc_sum-0.1.0-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for iscc_sum-0.1.0-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 2ddc8a8b865e54a3dc79bf0a3e52dac4636437f5d2f70db649222c19e3cef4a1
MD5 c7b9021956bab1f4493dd1ad58830540
BLAKE2b-256 655ea77a3989103710c72535530136b61ef6502b9e2bd116ac1d44d0be1abcdd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page