Skip to main content

The client library for Aryn services.

Project description

PyPI PyPI - Python Version Slack Docs License

aryn-sdk is a simple client library for interacting with Aryn DocParse.

Partition (Parse) files

Partition PDF files with Aryn DocParse through aryn-sdk:

from aryn_sdk.partition import partition_file

with open("partition-me.pdf", "rb") as f:
    data = partition_file(
        f,
        use_ocr=True,
        extract_table_structure=True,
        extract_images=True
    )
elements = data['elements']

Convert a partitioned table element to a pandas dataframe for easier use:

from aryn_sdk.partition import partition_file, table_elem_to_dataframe

with open("partition-me.pdf", "rb") as f:
    data = partition_file(
        f,
        use_ocr=True,
        extract_table_structure=True,
        extract_images=True
    )

# Find the first table and convert it to a dataframe
df = None
for element in data['elements']:
    if element['type'] == 'table':
        df = table_elem_to_dataframe(element)
        break

Or convert all partitioned tables to pandas dataframes in one shot:

from aryn_sdk.partition import partition_file, tables_to_pandas

with open("partition-me.pdf", "rb") as f:
    data = partition_file(
        f,
        use_ocr=True,
        extract_table_structure=True,
        extract_images=True
    )
elements_and_tables = tables_to_pandas(data)
dataframes = [table for (element, table) in elements_and_tables if table is not None]

Visualize partitioned documents by drawing on the bounding boxes:

from aryn_sdk.partition import partition_file, draw_with_boxes

with open("partition-me.pdf", "rb") as f:
    data = partition_file(
        f,
        use_ocr=True,
        extract_table_structure=True,
        extract_images=True
    )
page_pics = draw_with_boxes("partition-me.pdf", data, draw_table_cells=True)

from IPython.display import display
display(page_pics[0])

Note: visualizing documents requires poppler, a pdf processing library, to be installed. Instructions for installing poppler can be found here

Convert image elements to more useful types, like PIL, or image format typed byte strings

from aryn_sdk.partition import partition_file, convert_image_element

with open("my-favorite-pdf.pdf", "rb") as f:
    data = partition_file(
        f,
        extract_images=True
    )
image_elts = [e for e in data['elements'] if e['type'] == 'Image']

pil_img = convert_image_element(image_elts[0])
jpg_bytes = convert_image_element(image_elts[1], format='JPEG')
png_str = convert_image_element(image_elts[2], format="PNG", b64encode=True)

Document storage

The DocParse storage APIs provide a simple interface to interact with documents processed and stored by DocParse.

DocSets

The DocSet APIs allow you create, list, and delete DocSets to store your documents in.

from aryn.client.client import Client

client = Client()

# Create a new DocSet and get the ID.
new_docset = client.create_docset(name="My DocSet")
docset_id = new_docset.value.docset_id

# Retrieve a specific DocSet by ID.
docset = client.get_docset(docset_id=docset_id).value

# List all of the DocSets in your account.
docsets = client.list_docsets().get_all()

# Delete the DocSet you created
client.delete_docset(docset_id=docset_id)

Documents

The document APIs let you interact with individual documents, including the ability to retrieve the original file.

from aryn.client.client import Client

client = Client()

# Iterate through the documents in a single DocSet
docset_id = None # my docset id
paginator = client.list_docs(docset_id = docset_id)
for doc in paginator:
    print(f"Doc {doc.name} has id {doc.doc_id}")

# Get a single document
doc_id = None # my doc id
doc = client.get_doc(docset_id=docset_id, doc_id=doc_id).value

# Get the original pdf of a document and write to a file.
with open("/path/to/outfile", "wb") as out:
    client.get_doc_binary(docset_id=docset_id, doc_id=doc_id, file=out)

# Delete a document by id.
client.delete_doc(docset_id=docset_id, doc_id=doc_id)
client.get_doc_binary()

Query

You can run vector and keyword search queries on the documents stored in DocParse storage.

from aryn_sdk.client.client import Client

client = Client()
docset_id = None # my docset id

# Search by query
search_request = SearchRequest(query="test_query")
results = client.search(docset_id=docset_id, query="my query")

# Search by filter
filter_request = SearchRequest(query="test_filter_query", properties_filter="(properties.entity.name='test')")
results = client.search(docset_id=docset_id, query="my query")

Extract additional properties (metadata) from your documents

You can use LLMs to extract additional metadata from your documents in DocParse storage. These are stored as properties, and are extracted from every document in your DocSet.

from aryn_sdk.client.client import Client
from aryn_sdk.types.schema import Schema, SchemaField

client = Client()
docset_id = None # my docset id
schema_field = SchemaField(name="name", field_type="string")
schema = Schema(fields=[schema_field])

# Extract properties

client_obj.extract_properties(docset_id=docset_id, schema=schema)

# Delete extracted properties
client_obj.delete_properties(docset_id=docset_id, schema=schema)

Async APIs

Partitioning - Single Task Example

import time
from aryn_sdk.partition import partition_file_async_submit, partition_file_async_result

with open("my-favorite-pdf.pdf", "rb") as f:
    response = partition_file_async_submit(
        f,
        use_ocr=True,
        extract_table_structure=True,
    )

task_id = response["task_id"]

# Poll for the results
while True:
    result = partition_file_async_result(task_id)
    if result["task_status"] != "pending":
        break
    time.sleep(5)

Optionally, you can also set a webhook for Aryn to call when your task is completed:

partition_file_async_submit("path/to/my/file.docx", webhook_url="https://example.com/alert")

Aryn will POST a request containing a body like the below:

{"done": [{"task_id": "aryn:t-47gpd3604e5tz79z1jro5fc"}]}

Multi-Task Example

import logging
import time
from aryn_sdk.partition import partition_file_async_submit, partition_file_async_result

files = [open("file1.pdf", "rb"), open("file2.docx", "rb")]
task_ids = [None] * len(files)
for i, f in enumerate(files):
    try:
        task_ids[i] = partition_file_async_submit(f)["task_id"]
    except Exception as e:
        logging.warning(f"Failed to submit {f}: {e}")

results = [None] * len(files)
for i, task_id in enumerate(task_ids):
    while True:
        result = partition_file_async_result(task_id)
        if result["task_status"] != "pending":
            break
        time.sleep(5)
    results[i] = result

Cancelling an async task

from aryn_sdk.partition import partition_file_async_submit, partition_file_async_cancel
        task_id = partition_file_async_submit(
                    "path/to/file.pdf",
                    use_ocr=True,
                    extract_table_structure=True,
                    extract_images=True,
                )["task_id"]

        partition_file_async_cancel(task_id)

List pending tasks

from aryn_sdk.partition import partition_file_async_list
partition_file_async_list()

Async Properties (Extract and Delete) example

from aryn_sdk.client.client import Client
from aryn_sdk.types.schema import Schema, SchemaField

client = Client()

# Run extract_properties and delete_properties asynchronously
schema_field = SchemaField(name="name", field_type="string")
schema = Schema(fields=[schema_field])
client_obj.extract_properties_async(docset_id=docset_id, schema=schema) # async implementation
client_obj.delete_properties_async(docset_id=docset_id, schema=schema) # async implementation

# Check the status and get the task result
task = None # my task id
get_async_result = client.get_async_result(task=task_id)

# List all outstanding async tasks.
client.list_async_tasks()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aryn_sdk-0.2.0.tar.gz (1.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aryn_sdk-0.2.0-py3-none-any.whl (1.3 MB view details)

Uploaded Python 3

File details

Details for the file aryn_sdk-0.2.0.tar.gz.

File metadata

  • Download URL: aryn_sdk-0.2.0.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for aryn_sdk-0.2.0.tar.gz
Algorithm Hash digest
SHA256 4cc76daaca17da13ec594b3d2e3b0e1f53eebe0df734e2b71b40fe3504678bea
MD5 e48bd81164525225d611afb23d2cb56c
BLAKE2b-256 eb7381ee184bd5e584b3b112bee0d135457aeff129c2a23fb94a01b4cbff0adb

See more details on using hashes here.

Provenance

The following attestation bundles were made for aryn_sdk-0.2.0.tar.gz:

Publisher: release.yml on aryn-ai/aryn-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file aryn_sdk-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: aryn_sdk-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for aryn_sdk-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4dc812847c34504c03adb0a5200ea5393ddfb05988a9f18bf4ac8f0a8cd6594c
MD5 533486094b935fed7e59a94d966ef8ae
BLAKE2b-256 288d84dc338ed8ffffb35db43029f249837ee827b91b8d128a65e4ea8dcaa71d

See more details on using hashes here.

Provenance

The following attestation bundles were made for aryn_sdk-0.2.0-py3-none-any.whl:

Publisher: release.yml on aryn-ai/aryn-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page