Indexify Extractor SDK to build new extractors for extraction from unstructured data
Project description
Indexify Extractor SDK
Indexify Extractor SDK is for developing new extractors to extract information from any unstructured data sources.
We already have a few extractors here - https://github.com/tensorlakeai/indexify If you don't find one that works for your use-case use this SDK to build one.
Install the SDK
Install the SDK from PyPi
virtualenv ve
source ve/bin/activate
pip install indexify-extractor-sdk
Implement the extractor SDK
Implement the extractor interface
class MyExtractor(Extractor):
input_mime_types = ["text/plain", "application/pdf", "image/jpeg"]
def __init__(self):
super().__init__()
def extract(self, content: Content, params: InputParams) -> List[Content]:
return [
Content.from_text(
text="Hello World",
features=[
Feature.embedding(values=[1, 2, 3]),
Feature.metadata(json.loads('{"a": 1, "b": "foo"}')),
],
labels={"url": "test.com"},
),
Content.from_text(
text="Pipe Baz",
features=[Feature.embedding(values=[1, 2, 3])],
labels={"url": "test.com"},
),
]
def sample_input(self) -> Content:
return Content.from_text("hello world")
Test the extractor
You can run the extractor locally using the command line tool attached to the SDK like this, by passing some arbitrary text or a file.
indexify-extractor local my_extractor.py:MyExtractor --text "hello"
Deploy the extractor
Once you are ready to deploy the new extractor and ready to build pipelines with it. Package the extractor and deploy as many copies you want, and point it to the indexify server. Indexify server has two addresses, one for sending your extractor the extraction task, and another endpoint for your extractor to write the extracted content.
indexify-extractor join-server my_extractor.py:MyExtractor --coordinator-addr localhost:8950 --ingestion-addr:8900
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for indexify_extractor_sdk-0.0.57.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 32310326489e4eb2072ec8efa672a6c8b190207b8b9f51750a3466f7aa654bbb |
|
MD5 | c330633152f57eb097644b4a01b1d9a5 |
|
BLAKE2b-256 | 7466678ea19c5bbddaa2f19cff8ee7d9c6fcf15ef93c310e50ce3b990fa14d7c |
Hashes for indexify_extractor_sdk-0.0.57-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 013806beadeb83642cfcddce483398eed15537509ab26bf0501533f3043cf4f0 |
|
MD5 | 75fa3235a6782664cfff680bf2b7dbed |
|
BLAKE2b-256 | eac831753825c1ae034a7e95133b1cec95daaced1c80fad0235032e74f70a049 |