Skip to main content

ISCC - Semantic Code Text

Project description

ISCC - Semantic Text-Code

Tests Version Downloads

iscc-sct is a proof of concept implementation of a semantic Text-Code for the ISCC (International Standard Content Code). Semantic Text-Codes are designed to capture and represent the language agnostic semantic content of text for improved similarity detection.

[!CAUTION] This is an early proof of concept. All releases with version numbers below v1.0.0 may break backward compatibility and produce incompatible Semantic Text-Codes.

What is ISCC Semantic Text-Code?

The ISCC framework already includes a Text-Code based on lexical similarity for near-duplicate matching. The ISCC Semantic Text-Code is a planned additional ISCC-UNIT focused on capturing a more abstract and broader semantic similarity. It is engineered to be robust against a wide range of variations and, most remarkably, translations of text that cannot be matched based on lexical similarity alone.

Translation Matching

One of the most interesting aspects of the Semantic Text-Code is its ability to generate (near)-identical codes for translations of the same text. This means that the same content, expressed in different languages, can be identified and linked, opening up new possibilities for cross-lingual content identification and similarity detection.

Key Features

  • Semantic Similarity: Utilizes deep learning models to generate codes that reflect the semantic essence of text.
  • Translation Matching: Creates nearly identical codes for text translations, enabling cross-lingual content identification.
  • Bit-Length Flexibility: Supports generating codes of various bit lengths (up to 256 bits), allowing for adjustable granularity in similarity detection.
  • ISCC Compatible: Generates codes fully compatible with the ISCC specification, facilitating seamless integration with existing ISCC-based systems.

Installation

Ensure you have Python 3.9 or newer installed on your system. Install the library using:

pip install iscc-sct

For systems with GPU CUDA support, enhance performance by installing with:

pip install iscc-sct[gpu]

Usage

Generate a Semantic Text-Code using the create function:

>>> import iscc_sct as sct
>>> text = "This is some sample text. It can be a longer document or even an entire book."
>>> sct.create(text, bits=256)
{
  "iscc": "ISCC:CADV3GG6JH3XEVRNSVYGCLJ7AAV3BOT5J7EHEZKPFXEGRJ2CTWACGZI",
  "characters": 77
}

For granular (per chunk) feature outputs:

>>> import iscc_sct as sct
>>> text = "This is some sample text. It can be a longer document or even an entire book."
>>> sct.create(text, bits=256, granular=True)
{
  "iscc": "ISCC:CADV3GG6JH3XEVRNSVYGCLJ7AAV3BOT5J7EHEZKPFXEGRJ2CTWACGZI",
  "characters": 77,
  "features": [
    {
      "feature": "LWMN4SPXOJLC2",
      "offset": 0,
      "size": 77,
      "text": "This is some sample text. It can be a longer document or even an entire book."
    }
  ]
}

The installation also provides a sct command-line tool:

sct --help
usage: sct [-h] [-b BITS] [-g] [-d] [path]

Generate Semantic Text-Codes for text files.

positional arguments:
  path                  Path to text files (supports glob patterns).

options:
  -h, --help            show this help message and exit
  -b BITS, --bits BITS  Bit-Length of Code (default 256)
  -g, --granular        Activate granular processing.
  -d, --debug           Show debugging messages.

How It Works

iscc-sct employs the following process:

  1. Splits the text into semantically coherent chunks.
  2. Uses a pre-trained deep learning model for text embedding.
  3. Generates feature vectors capturing essential characteristics of the chunks.
  4. Aggregates these vectors and binarizes them to produce a Semantic Text-Code.

This process ensures robustness to variations and translations, enabling cross-lingual matching.

Development and Contributing

We welcome contributions to enhance the capabilities, efficiency, and compatibility of this proof of concept with the broader ISCC ecosystem. For development, install the project in development mode using Poetry:

git clone https://github.com/iscc/iscc-sct.git
cd iscc-sct
poetry install

If you have suggestions for improvements or bug fixes, please open an issue or pull request. For major changes, please open an issue first to discuss your ideas.

Acknowledgements

License

This project is licensed under the CC-BY-NC-SA-4.0 International License.

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iscc_sct-0.1.1.tar.gz (3.5 MB view details)

Uploaded Source

Built Distribution

iscc_sct-0.1.1-py3-none-any.whl (3.6 MB view details)

Uploaded Python 3

File details

Details for the file iscc_sct-0.1.1.tar.gz.

File metadata

  • Download URL: iscc_sct-0.1.1.tar.gz
  • Upload date:
  • Size: 3.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.1 Windows/10

File hashes

Hashes for iscc_sct-0.1.1.tar.gz
Algorithm Hash digest
SHA256 14a14e33c1940a8252d38441799f36544c3f419126acf58d5da56f38a83b38db
MD5 45be172d24d9a5236a45dc55945425db
BLAKE2b-256 974f5cdae9137b2129570d042ddd9393ced48e838b6416587e7e1c8bb826fd55

See more details on using hashes here.

File details

Details for the file iscc_sct-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: iscc_sct-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 3.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.1 Windows/10

File hashes

Hashes for iscc_sct-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cab3b01bdd0d2814c2d8c7aa9c95666ef3a532a6d98676bfdc8d385d12be9630
MD5 c009fb1527db54ab184d6211d8539253
BLAKE2b-256 eaf9910621cc7f5de7ecc3989a4fb81701bd3ad3845252e04e6302d9c648e99e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page