Skip to main content

ISCC - Semantic Code Text

Project description

ISCC - Semantic Text-Code

iscc-sct is a proof of concept implementation of a semantic Text-Code for the ISCC (International Standard Content Code). Semantic Text-Codes are designed to capture and represent the language agnostic semantic content of text for improved similarity detection.

[!CAUTION] This is an early proof of concept. All releases with release numbers below v1.0.0 may break backward compatibility and produce incompatible Semantic Text-Codes.

What is ISCC Semantic Text-Code

The ISCC framework already comes with a Text-Code that is based on lexical similarity and can match near duplicates. The ISCC Semantic Text-Code is planned as a new additional ISCC-UNIT focused on capturing a more abstract and broad semantic similarity. As such the Semantic Text-Code is engineered to be robust against a broader range of variations and translations of text that cannot be matched based on lexical similarity.

Features

  • Semantic Similarity: Leverages deep learning models to generate codes that reflect the semantic content of text.
  • Bit-Length Flexibility: Supports generating codes of various bit lengths (up to 256 bits), allowing for adjustable granularity in similarity detection.
  • ISCC Compatible: Generates codes that are fully compatible with the ISCC specification, facilitating integration with existing ISCC-based systems.

Installation

Before you can install iscc-sct, you need to have Python 3.8 or newer installed on your system. Install the library as follows:

pip install iscc-sct

If your system has GPU CUDA support you can improve perfomance by installing with GPU support:

pip install iscc-sct[gpu]

Usage

To generate a Semantic Text-Code use the create function.

>>> import iscc_sct as sci
>>> text = "This is some sample text. It can be a longer document or even an entire book."
>>> sci.create(text)
{
  "iscc": "ISCC:CAAVZHGOJH3XUFRF",
  "characters": 89
}

You can also generate granular (per chunk) feature outputs:

>>> import iscc_sct as sci
>>> text = "This is some sample text. It can be a longer document or even an entire book."
>>> sci.create(text, granular=True)
{
  "iscc": "ISCC:CAAV3GG6JH3XEVRN",
  "characters": 77,
  "features": [
    {
      "feature": "LWMN4SPXOJLC2",
      "offset": 0,
      "size": 77,
      "text": "This is some sample text. It can be a longer document or even an entire book."
    }
  ]
}

Installation also creates a simple sct command line tool in you python bin/Scripts folder:

sct --help
usage: sct [-h] [-b BITS] [-g] [-d] [path]

Generate Semantic Text-Codes for text files.

positional arguments:
  path                  Path to text files (supports glob patterns).

options:
  -h, --help            show this help message and exit
  -b BITS, --bits BITS  Bit-Length of Code (default 256)
  -g, --granular        Activate granular processing.
  -d, --debug           Show debugging messages.

How It Works

iscc-sct splits the text into chunks and uses a pre-trained deep learning model for text embedding. The model generates a feature vector that captures the essential characteristics of the chunks. These vectors are aggregated and then binarized to produce a Semantic Text-Code that is robust to variations/translations of the text.

Development

This is a proof of concept and welcomes contributions to enhance its capabilities, efficiency, and compatibility with the broader ISCC ecosystem. For development, you'll need to install the project in development mode using Poetry.

git clone https://github.com/iscc/iscc-sct.git
cd iscc-sct
poetry install

Contributing

Contributions are welcome! If you have suggestions for improvements or bug fixes, please open an issue or pull request. For major changes, please open an issue first to discuss what you would like to change.

License

This project is licensed under the CC-BY-NC-SA-4.0 International License.

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iscc_sct-0.1.0.tar.gz (3.5 MB view details)

Uploaded Source

Built Distribution

iscc_sct-0.1.0-py3-none-any.whl (3.6 MB view details)

Uploaded Python 3

File details

Details for the file iscc_sct-0.1.0.tar.gz.

File metadata

  • Download URL: iscc_sct-0.1.0.tar.gz
  • Upload date:
  • Size: 3.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.1 Windows/10

File hashes

Hashes for iscc_sct-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d387ac8d36b9825b0ad5eeea2bf6353472840543060865485bda5806ee61e2fc
MD5 28de1304b8b5f7e7762f114106315d47
BLAKE2b-256 26cc380915e28cf750f5775953f5fb8df3d17c0432f03a62ef6eda346e98e72f

See more details on using hashes here.

File details

Details for the file iscc_sct-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: iscc_sct-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 3.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.1 Windows/10

File hashes

Hashes for iscc_sct-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2d569ff6cd09ffc335d399e147e7d076c43a1f3eac1a24bce55607f63c30eb7d
MD5 7ee7d66878ddb96e67436191536e6670
BLAKE2b-256 76f370ccdd13bf630568af5ca7d61032813bcbfacc3bd7c7d6f6b2046e9eb37f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page