Skip to main content

Indexify Extractor SDK to build new extractors for extraction from unstructured data

Reason this release was yanked:

Bad release

Project description

Indexify Extractor SDK

PyPI version

Indexify Extractor SDK is for developing new extractors to extract information from any unstructured data sources.

We already have a few extractors here - https://github.com/tensorlakeai/indexify If you don't find one that works for your use-case use this SDK to build one.

Install the SDK

Install the SDK from PyPi

virtualenv ve
source ve/bin/activate
pip install indexify-extractor-sdk

Implement the extractor SDK

There are two ways to implement an extractor. If you don't need any setup/teardown or additional functionality, check out the decorator:

from indexify_extractor_sdk import Content, extractor

@extractor()
def my_extractor(content: Content, params: dict) -> List[Content]:
    return [
        Content.from_text(
            text="Hello World",
            features=[
                Feature.embedding(values=[1, 2, 3]),
                Feature.metadata(json.loads('{"a": 1, "b": "foo"}')),
            ],
            labels={"url": "test.com"},
        ),
        Content.from_text(
            text="Pipe Baz",
            features=[Feature.embedding(values=[1, 2, 3])],
            labels={"url": "test.com"},
        ),
    ]

Note: @extractor() takes many parameters, check out the documentation for more details.

For more advanced use cases, check out the class:

from indexify_extractor_sdk import Content, Extractor, Feature
from pydantic import BaseModel

class InputParams(BaseModel):
    pass

class MyExtractor(Extractor):
    input_mime_types = ["text/plain", "application/pdf", "image/jpeg"]

    def __init__(self):
        super().__init__()

    def extract(self, content: Content, params: InputParams) -> List[Content]:
        return [
            Content.from_text(
                text="Hello World",
                features=[
                    Feature.embedding(values=[1, 2, 3]),
                    Feature.metadata(json.loads('{"a": 1, "b": "foo"}')),
                ],
                labels={"url": "test.com"},
            ),
            Content.from_text(
                text="Pipe Baz",
                features=[Feature.embedding(values=[1, 2, 3])],
                labels={"url": "test.com"},
            ),
        ]

    def sample_input(self) -> Content:
        return Content.from_text("hello world")

Test the extractor

You can run the extractor locally using the command line tool attached to the SDK like this, by passing some arbitrary text or a file.

indexify-extractor local my_extractor:MyExtractor --text "hello"

Deploy the extractor

Once you are ready to deploy the new extractor and ready to build pipelines with it. Package the extractor and deploy as many copies you want, and point it to the indexify server. Indexify server has two addresses, one for sending your extractor the extraction task, and another endpoint for your extractor to write the extracted content.

indexify-extractor join-server --coordinator-addr localhost:8950 --ingestion-addr:8900

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

indexify_extractor_sdk-0.0.88.tar.gz (49.4 kB view details)

Uploaded Source

Built Distribution

indexify_extractor_sdk-0.0.88-py3-none-any.whl (61.0 kB view details)

Uploaded Python 3

File details

Details for the file indexify_extractor_sdk-0.0.88.tar.gz.

File metadata

File hashes

Hashes for indexify_extractor_sdk-0.0.88.tar.gz
Algorithm Hash digest
SHA256 701a610a8d5dd3487ce2f2223f3738e72605cf43ab952368551a68b28d9980c3
MD5 6f4bdee67d5b4607c445ca1678dba704
BLAKE2b-256 d033e145d3a81420cc9084bfd3a6939de21775b9a597010c1f2b2060161691a4

See more details on using hashes here.

File details

Details for the file indexify_extractor_sdk-0.0.88-py3-none-any.whl.

File metadata

File hashes

Hashes for indexify_extractor_sdk-0.0.88-py3-none-any.whl
Algorithm Hash digest
SHA256 d3f00857f4c49dc22cf7b73a2234324aa9138e0f49c634df2cc04ce3be74bb8e
MD5 1ff04c840edfd6d47ab7dd75b5c7dfee
BLAKE2b-256 b7aa270be9c2a1893ea1234a181755e82a6cc01f6c83dafe55f6342a2ca1dfb4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page