ISCC - Semantic Code Text
Project description
ISCC - Semantic Text-Code
iscc-sct
is a proof of concept implementation of a semantic Text-Code for the ISCC
(International Standard Content Code). Semantic Text-Codes are designed to capture and represent the language
agnostic semantic content of text for improved similarity detection.
[!CAUTION] This is an early proof of concept. All releases with release numbers below v1.0.0 may break backward compatibility and produce incompatible Semantic Text-Codes.
What is ISCC Semantic Text-Code
The ISCC framework already comes with a Text-Code that is based on lexical similarity and can match near duplicates. The ISCC Semantic Text-Code is planned as a new additional ISCC-UNIT focused on capturing a more abstract and broad semantic similarity. As such the Semantic Text-Code is engineered to be robust against a broader range of variations and translations of text that cannot be matched based on lexical similarity.
Features
- Semantic Similarity: Leverages deep learning models to generate codes that reflect the semantic content of text.
- Bit-Length Flexibility: Supports generating codes of various bit lengths (up to 256 bits), allowing for adjustable granularity in similarity detection.
- ISCC Compatible: Generates codes that are fully compatible with the ISCC specification, facilitating integration with existing ISCC-based systems.
Installation
Before you can install iscc-sct
, you need to have Python 3.8 or newer installed on your system. Install the library
as follows:
pip install iscc-sct
If your system has GPU CUDA support you can improve perfomance by installing with GPU support:
pip install iscc-sct[gpu]
Usage
To generate a Semantic Text-Code use the create
function.
>>> import iscc_sct as sci
>>> text = "This is some sample text. It can be a longer document or even an entire book."
>>> sci.create(text)
{
"iscc": "ISCC:CAAVZHGOJH3XUFRF",
"characters": 89
}
You can also generate granular (per chunk) feature outputs:
>>> import iscc_sct as sci
>>> text = "This is some sample text. It can be a longer document or even an entire book."
>>> sci.create(text, granular=True)
{
"iscc": "ISCC:CAAV3GG6JH3XEVRN",
"characters": 77,
"features": [
{
"feature": "LWMN4SPXOJLC2",
"offset": 0,
"size": 77,
"text": "This is some sample text. It can be a longer document or even an entire book."
}
]
}
Installation also creates a simple sct
command line tool in you python bin/Scripts folder:
sct --help
usage: sct [-h] [-b BITS] [-g] [-d] [path]
Generate Semantic Text-Codes for text files.
positional arguments:
path Path to text files (supports glob patterns).
options:
-h, --help show this help message and exit
-b BITS, --bits BITS Bit-Length of Code (default 256)
-g, --granular Activate granular processing.
-d, --debug Show debugging messages.
How It Works
iscc-sct
splits the text into chunks and uses a pre-trained deep learning model for text embedding. The model
generates a feature vector that captures the essential characteristics of the chunks. These vectors are aggregated and
then binarized to produce a Semantic Text-Code that is robust to variations/translations of the text.
Development
This is a proof of concept and welcomes contributions to enhance its capabilities, efficiency, and compatibility with the broader ISCC ecosystem. For development, you'll need to install the project in development mode using Poetry.
git clone https://github.com/iscc/iscc-sct.git
cd iscc-sct
poetry install
Contributing
Contributions are welcome! If you have suggestions for improvements or bug fixes, please open an issue or pull request. For major changes, please open an issue first to discuss what you would like to change.
License
This project is licensed under the CC-BY-NC-SA-4.0 International License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file iscc_sct-0.1.0.tar.gz
.
File metadata
- Download URL: iscc_sct-0.1.0.tar.gz
- Upload date:
- Size: 3.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.1 CPython/3.11.1 Windows/10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d387ac8d36b9825b0ad5eeea2bf6353472840543060865485bda5806ee61e2fc |
|
MD5 | 28de1304b8b5f7e7762f114106315d47 |
|
BLAKE2b-256 | 26cc380915e28cf750f5775953f5fb8df3d17c0432f03a62ef6eda346e98e72f |
File details
Details for the file iscc_sct-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: iscc_sct-0.1.0-py3-none-any.whl
- Upload date:
- Size: 3.6 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.1 CPython/3.11.1 Windows/10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2d569ff6cd09ffc335d399e147e7d076c43a1f3eac1a24bce55607f63c30eb7d |
|
MD5 | 7ee7d66878ddb96e67436191536e6670 |
|
BLAKE2b-256 | 76f370ccdd13bf630568af5ca7d61032813bcbfacc3bd7c7d6f6b2046e9eb37f |