Indexify Extractor SDK to build new extractors for extraction from unstructured data

These details have not been verified by PyPI

Project description

Indexify Extractor SDK

Indexify Extractor SDK is for developing new extractors to extract information from any unstructured data sources.

We already have a few extractors here - https://github.com/tensorlakeai/indexify If you don't find one that works for your use-case use this SDK to build one.

Install the SDK

Install the SDK from PyPi

virtualenv ve
source ve/bin/activate
pip install indexify-extractor-sdk

Implement the extractor SDK

There are two ways to implement an extractor. If you don't need any setup/teardown or additional functionality, check out the decorator:

from indexify_extractor_sdk import Content, extractor

@extractor()
def my_extractor(content: Content, params: dict) -> List[Content]:
    return [
        Content.from_text(
            text="Hello World",
            features=[
                Feature.embedding(values=[1, 2, 3]),
                Feature.metadata(json.loads('{"a": 1, "b": "foo"}')),
            ],
            labels={"url": "test.com"},
        ),
        Content.from_text(
            text="Pipe Baz",
            features=[Feature.embedding(values=[1, 2, 3])],
            labels={"url": "test.com"},
        ),
    ]

Note: @extractor() takes many parameters, check out the documentation for more details.

For more advanced use cases, check out the class:

from indexify_extractor_sdk import Content, Extractor, Feature
from pydantic import BaseModel

class InputParams(BaseModel):
    pass

class MyExtractor(Extractor):
    input_mime_types = ["text/plain", "application/pdf", "image/jpeg"]

    def __init__(self):
        super().__init__()

    def extract(self, content: Content, params: InputParams) -> List[Content]:
        return [
            Content.from_text(
                text="Hello World",
                features=[
                    Feature.embedding(values=[1, 2, 3]),
                    Feature.metadata(json.loads('{"a": 1, "b": "foo"}')),
                ],
                labels={"url": "test.com"},
            ),
            Content.from_text(
                text="Pipe Baz",
                features=[Feature.embedding(values=[1, 2, 3])],
                labels={"url": "test.com"},
            ),
        ]

    def sample_input(self) -> Content:
        return Content.from_text("hello world")

Test the extractor

You can run the extractor locally using the command line tool attached to the SDK like this, by passing some arbitrary text or a file.

indexify-extractor local my_extractor:MyExtractor --text "hello"

Deploy the extractor

Once you are ready to deploy the new extractor and ready to build pipelines with it. Package the extractor and deploy as many copies you want, and point it to the indexify server. Indexify server has two addresses, one for sending your extractor the extraction task, and another endpoint for your extractor to write the extracted content.

indexify-extractor join-server --coordinator-addr localhost:8950 --ingestion-addr:8900

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.0.92

Aug 28, 2024

0.0.91

Aug 24, 2024

0.0.90

Aug 21, 2024

0.0.89

Aug 20, 2024

0.0.88 yanked

Aug 20, 2024

Reason this release was yanked:

Bad release

0.0.87

Jul 24, 2024

0.0.86

Jul 19, 2024

0.0.84

Jul 4, 2024

0.0.83

Jul 1, 2024

0.0.82

Jun 15, 2024

0.0.81

Jun 14, 2024

0.0.80

Jun 9, 2024

0.0.79

Jun 9, 2024

0.0.78

Jun 9, 2024

0.0.77

Jun 9, 2024

0.0.76

Jun 9, 2024

0.0.75

Jun 8, 2024

0.0.74

Jun 5, 2024

0.0.73

Jun 5, 2024

0.0.72

Jun 5, 2024

0.0.71

May 31, 2024

0.0.70

May 31, 2024

0.0.69

May 30, 2024

0.0.66

May 22, 2024

0.0.65

May 21, 2024

0.0.64

May 21, 2024

0.0.63

May 15, 2024

0.0.62

May 14, 2024

0.0.61

May 13, 2024

0.0.60

May 10, 2024

0.0.58

May 7, 2024

0.0.57

Apr 29, 2024

0.0.56

Apr 28, 2024

0.0.55

Apr 28, 2024

0.0.54

Apr 26, 2024

0.0.53

Apr 26, 2024

0.0.52

Apr 26, 2024

0.0.51

Apr 26, 2024

0.0.50

Apr 26, 2024

0.0.49

Apr 26, 2024

0.0.48

Apr 25, 2024

0.0.47

Apr 24, 2024

0.0.46

Apr 22, 2024

0.0.45

Apr 12, 2024

0.0.44

Apr 11, 2024

0.0.43

Apr 10, 2024

0.0.42

Apr 4, 2024

0.0.41

Apr 3, 2024

0.0.40

Apr 2, 2024

0.0.39

Mar 30, 2024

0.0.38

Mar 27, 2024

0.0.37

Mar 26, 2024

0.0.36

Mar 24, 2024

0.0.35

Mar 24, 2024

0.0.34

Mar 23, 2024

0.0.33

Mar 23, 2024

0.0.32

Mar 22, 2024

0.0.31

Mar 22, 2024

0.0.30

Mar 20, 2024

0.0.29

Mar 13, 2024

0.0.28

Mar 12, 2024

0.0.27

Mar 7, 2024

0.0.26

Feb 26, 2024

0.0.25

Feb 21, 2024

0.0.23

Feb 18, 2024

0.0.22

Feb 17, 2024

0.0.21

Feb 16, 2024

0.0.20

Feb 15, 2024

0.0.19

Feb 14, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

indexify_extractor_sdk-0.0.92.tar.gz (49.4 kB view details)

Uploaded Aug 28, 2024 Source

Built Distribution

indexify_extractor_sdk-0.0.92-py3-none-any.whl (61.7 kB view details)

Uploaded Aug 28, 2024 Python 3

File details

Details for the file indexify_extractor_sdk-0.0.92.tar.gz.

File metadata

Download URL: indexify_extractor_sdk-0.0.92.tar.gz
Upload date: Aug 28, 2024
Size: 49.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for indexify_extractor_sdk-0.0.92.tar.gz
Algorithm	Hash digest
SHA256	`ed569429ecd95902fb77393e6c7f121a4e3ab40a5d018796f7472f9e82ec26b8`
MD5	`7489f4a58cc779e56a6212f7b4b73008`
BLAKE2b-256	`8aeb30938c69108035827cb42444adf5ff4af2edaba7120fd4669764c1f34458`

See more details on using hashes here.

File details

Details for the file indexify_extractor_sdk-0.0.92-py3-none-any.whl.

File metadata

Download URL: indexify_extractor_sdk-0.0.92-py3-none-any.whl
Upload date: Aug 28, 2024
Size: 61.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for indexify_extractor_sdk-0.0.92-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4c72adaa43cd30edae806499cf7c1845b8ea40a90a1165f7d0ead7d1706e42d9`
MD5	`5e4934410f7f166d8f0723050ab44570`
BLAKE2b-256	`20c57ad8f07df37460077fff79c4974331fa88ed8742adfacecc74d4b58e2079`