Skip to main content

The client library for Aryn services

Reason this release was yanked:

Hardcoded self-reported version number was 0.1.11

Project description

PyPI PyPI - Python Version Slack Docs License

aryn-sdk is a simple client library for interacting with Aryn cloud services.

Aryn DocParse

Partition pdf files with Aryn DocParse through aryn-sdk:

from aryn_sdk.partition import partition_file

with open("partition-me.pdf", "rb") as f:
    data = partition_file(
        f,
        use_ocr=True,
        extract_table_structure=True,
        extract_images=True
    )
elements = data['elements']

Convert a partitioned table element to a pandas dataframe for easier use:

from aryn_sdk.partition import partition_file, table_elem_to_dataframe

with open("partition-me.pdf", "rb") as f:
    data = partition_file(
        f,
        use_ocr=True,
        extract_table_structure=True,
        extract_images=True
    )

# Find the first table and convert it to a dataframe
df = None
for element in data['elements']:
    if element['type'] == 'table':
        df = table_elem_to_dataframe(element)
        break

Or convert all partitioned tables to pandas dataframes in one shot:

from aryn_sdk.partition import partition_file, tables_to_pandas

with open("partition-me.pdf", "rb") as f:
    data = partition_file(
        f,
        use_ocr=True,
        extract_table_structure=True,
        extract_images=True
    )
elements_and_tables = tables_to_pandas(data)
dataframes = [table for (element, table) in elements_and_tables if table is not None]

Visualize partitioned documents by drawing on the bounding boxes:

from aryn_sdk.partition import partition_file, draw_with_boxes

with open("partition-me.pdf", "rb") as f:
    data = partition_file(
        f,
        use_ocr=True,
        extract_table_structure=True,
        extract_images=True
    )
page_pics = draw_with_boxes("partition-me.pdf", data, draw_table_cells=True)

from IPython.display import display
display(page_pics[0])

Note: visualizing documents requires poppler, a pdf processing library, to be installed. Instructions for installing poppler can be found here

Convert image elements to more useful types, like PIL, or image format typed byte strings

from aryn_sdk.partition import partition_file, convert_image_element

with open("my-favorite-pdf.pdf", "rb") as f:
    data = partition_file(
        f,
        extract_images=True
    )
image_elts = [e for e in data['elements'] if e['type'] == 'Image']

pil_img = convert_image_element(image_elts[0])
jpg_bytes = convert_image_element(image_elts[1], format='JPEG')
png_str = convert_image_element(image_elts[2], format="PNG", b64encode=True)

Async Aryn DocParse

Single Job Example

import time
from aryn_sdk.partition import partition_file_async_submit, partition_file_async_result

with open("my-favorite-pdf.pdf", "rb") as f:
    response = partition_file_async_submit(
        f,
        use_ocr=True,
        extract_table_structure=True,
    )

job_id = response["job_id"]

# Poll for the results
while True:
    result = partition_file_async_result(job_id)
    if result["status"] != "pending":
        break
    time.sleep(5)

Optionally, you can also set a webhook for Aryn to call when your job is completed:

partition_file_async_submit("path/to/my/file.docx", webhook_url="https://example.com/alert")

Aryn will POST a request containing a body like the below:

{"done": [{"job_id": "aryn:j-47gpd3604e5tz79z1jro5fc"}]}

Multi-Job Example

import logging
import time
from aryn_sdk.partition import partition_file_async_submit, partition_file_async_result

files = [open("file1.pdf", "rb"), open("file2.docx", "rb")]
job_ids = [None] * len(files)
for i, f in enumerate(files):
    try:
        job_ids[i] = partition_file_async_submit(f)["job_id"]
    except Exception as e:
        logging.warning(f"Failed to submit {f}: {e}")

results = [None] * len(files)
for i, job_id in enumerate(job_ids):
    while True:
        result = partition_file_async_result(job_id)
        if result["status"] != "pending":
            break
        time.sleep(5)
    results[i] = result

Cancelling an async job

from aryn_sdk.partition import partition_file_async_submit, partition_file_async_cancel
        job_id = partition_file_async_submit(
                    "path/to/file.pdf",
                    use_ocr=True,
                    extract_table_structure=True,
                    extract_images=True,
                )["job_id"]

        partition_file_async_cancel(job_id)

List pending jobs

from aryn_sdk.partition import partition_file_async_list
partition_file_async_list()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aryn_sdk-0.1.12.tar.gz (1.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aryn_sdk-0.1.12-py3-none-any.whl (1.3 MB view details)

Uploaded Python 3

File details

Details for the file aryn_sdk-0.1.12.tar.gz.

File metadata

  • Download URL: aryn_sdk-0.1.12.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for aryn_sdk-0.1.12.tar.gz
Algorithm Hash digest
SHA256 8c83a50b72a1c772936f4259d86020db154a6d845530c3dd90d3c956b22fbd87
MD5 26a585559bd276e00c77cbdb6d09829f
BLAKE2b-256 4a5c780e7de5b916a77ebff54a462add31ae74bc0479e9edb8625e62b37c18a9

See more details on using hashes here.

Provenance

The following attestation bundles were made for aryn_sdk-0.1.12.tar.gz:

Publisher: aryn-sdk_release.yml on aryn-ai/sycamore

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file aryn_sdk-0.1.12-py3-none-any.whl.

File metadata

  • Download URL: aryn_sdk-0.1.12-py3-none-any.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for aryn_sdk-0.1.12-py3-none-any.whl
Algorithm Hash digest
SHA256 8a53a780ef4244c36993109906b4a240bb92ff78c2722dcbebe7f80baf7c225e
MD5 5de4d75ba616650516ff88c51e529998
BLAKE2b-256 36348efcd4c5ce6ff9734b5737399febf96cadbcde3d411e5c5f3f3086c29a6a

See more details on using hashes here.

Provenance

The following attestation bundles were made for aryn_sdk-0.1.12-py3-none-any.whl:

Publisher: aryn-sdk_release.yml on aryn-ai/sycamore

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page