Indexify Extractor SDK to build new extractors for extraction from unstructured data
Project description
Indexify Extractor SDK
Indexify Extractor SDK is for developing new extractors to extract information from any unstructured data sources.
We already have a few extractors here - https://github.com/tensorlakeai/indexify If you don't find one that works for your use-case use this SDK to build one.
Install the SDK
Install the SDK from PyPi
virtualenv ve
source ve/bin/activate
pip install indexify-extractor-sdk
Implement the extractor SDK
There are two ways to implement an extractor. If you don't need any setup/teardown or additional functionality, check out the decorator:
from indexify_extractor_sdk import Content, extractor
@extractor()
def my_extractor(content: Content, params: dict) -> List[Content]:
return [
Content.from_text(
text="Hello World",
features=[
Feature.embedding(values=[1, 2, 3]),
Feature.metadata(json.loads('{"a": 1, "b": "foo"}')),
],
labels={"url": "test.com"},
),
Content.from_text(
text="Pipe Baz",
features=[Feature.embedding(values=[1, 2, 3])],
labels={"url": "test.com"},
),
]
Note: @extractor()
takes many parameters, check out the documentation for more
details.
For more advanced use cases, check out the class:
from indexify_extractor_sdk import Content, Extractor, Feature
from pydantic import BaseModel
class InputParams(BaseModel):
pass
class MyExtractor(Extractor):
input_mime_types = ["text/plain", "application/pdf", "image/jpeg"]
def __init__(self):
super().__init__()
def extract(self, content: Content, params: InputParams) -> List[Content]:
return [
Content.from_text(
text="Hello World",
features=[
Feature.embedding(values=[1, 2, 3]),
Feature.metadata(json.loads('{"a": 1, "b": "foo"}')),
],
labels={"url": "test.com"},
),
Content.from_text(
text="Pipe Baz",
features=[Feature.embedding(values=[1, 2, 3])],
labels={"url": "test.com"},
),
]
def sample_input(self) -> Content:
return Content.from_text("hello world")
Test the extractor
You can run the extractor locally using the command line tool attached to the SDK like this, by passing some arbitrary text or a file.
indexify-extractor local my_extractor:MyExtractor --text "hello"
Deploy the extractor
Once you are ready to deploy the new extractor and ready to build pipelines with it. Package the extractor and deploy as many copies you want, and point it to the indexify server. Indexify server has two addresses, one for sending your extractor the extraction task, and another endpoint for your extractor to write the extracted content.
indexify-extractor join-server --coordinator-addr localhost:8950 --ingestion-addr:8900
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file indexify_extractor_sdk-0.0.91.tar.gz
.
File metadata
- Download URL: indexify_extractor_sdk-0.0.91.tar.gz
- Upload date:
- Size: 49.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bb3653efcc37c552be00e020085cacfc2d37e6f4605b716cc22476fcb90614d1 |
|
MD5 | f0660fa320dd28c1ad82ad0c798ed8a0 |
|
BLAKE2b-256 | 0ba2d606903abc4619a2477796b030b851ac6c912eee93c40a0c92924e9e31f1 |
File details
Details for the file indexify_extractor_sdk-0.0.91-py3-none-any.whl
.
File metadata
- Download URL: indexify_extractor_sdk-0.0.91-py3-none-any.whl
- Upload date:
- Size: 61.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 05978c679b03086064ae004aedd979526fef6f67cdbd043518b9435387620d38 |
|
MD5 | a703ea8e56d0abc2684b986d464f3766 |
|
BLAKE2b-256 | ac1694114e94f30e1ace185b237b8e57a3dd81db104d498fee4b2c625d4cc141 |