Skip to main content

The client library for Aryn services.

Project description

PyPI PyPI - Python Version Slack Docs License

aryn-sdk is a simple client library for interacting with Aryn DocParse.

Partition (Parse) files

Partition PDF files with Aryn DocParse through aryn-sdk:

from aryn_sdk.partition import partition_file

with open("partition-me.pdf", "rb") as f:
    data = partition_file(
        f,
        use_ocr=True,
        extract_table_structure=True,
        extract_images=True
    )
elements = data['elements']

Convert a partitioned table element to a pandas dataframe for easier use:

from aryn_sdk.partition import partition_file, table_elem_to_dataframe

with open("partition-me.pdf", "rb") as f:
    data = partition_file(
        f,
        use_ocr=True,
        extract_table_structure=True,
        extract_images=True
    )

# Find the first table and convert it to a dataframe
df = None
for element in data['elements']:
    if element['type'] == 'table':
        df = table_elem_to_dataframe(element)
        break

Or convert all partitioned tables to pandas dataframes in one shot:

from aryn_sdk.partition import partition_file, tables_to_pandas

with open("partition-me.pdf", "rb") as f:
    data = partition_file(
        f,
        use_ocr=True,
        extract_table_structure=True,
        extract_images=True
    )
elements_and_tables = tables_to_pandas(data)
dataframes = [table for (element, table) in elements_and_tables if table is not None]

Visualize partitioned documents by drawing on the bounding boxes:

from aryn_sdk.partition import partition_file, draw_with_boxes

with open("partition-me.pdf", "rb") as f:
    data = partition_file(
        f,
        use_ocr=True,
        extract_table_structure=True,
        extract_images=True
    )
page_pics = draw_with_boxes("partition-me.pdf", data, draw_table_cells=True)

from IPython.display import display
display(page_pics[0])

Note: visualizing documents requires poppler, a pdf processing library, to be installed. Instructions for installing poppler can be found here

Convert image elements to more useful types, like PIL, or image format typed byte strings

from aryn_sdk.partition import partition_file, convert_image_element

with open("my-favorite-pdf.pdf", "rb") as f:
    data = partition_file(
        f,
        extract_images=True
    )
image_elts = [e for e in data['elements'] if e['type'] == 'Image']

pil_img = convert_image_element(image_elts[0])
jpg_bytes = convert_image_element(image_elts[1], format='JPEG')
png_str = convert_image_element(image_elts[2], format="PNG", b64encode=True)

Document storage

The DocParse storage APIs provide a simple interface to interact with documents processed and stored by DocParse.

DocSets

The DocSet APIs allow you create, list, and delete DocSets to store your documents in.

from aryn.client.client import Client

client = Client()

# Create a new DocSet and get the ID.
new_docset = client.create_docset(name="My DocSet")
docset_id = new_docset.value.docset_id

# Retrieve a specific DocSet by ID.
docset = client.get_docset(docset_id=docset_id).value

# List all of the DocSets in your account.
docsets = client.list_docsets().get_all()

# Delete the DocSet you created
client.delete_docset(docset_id=docset_id)

Documents

The document APIs let you interact with individual documents, including the ability to retrieve the original file.

from aryn.client.client import Client

client = Client()

# Iterate through the documents in a single DocSet
docset_id = None # my docset id
paginator = client.list_docs(docset_id = docset_id)
for doc in paginator:
    print(f"Doc {doc.name} has id {doc.doc_id}")

# Get a single document
doc_id = None # my doc id
doc = client.get_doc(docset_id=docset_id, doc_id=doc_id).value

# Get the original pdf of a document and write to a file.
with open("/path/to/outfile", "wb") as out:
    client.get_doc_binary(docset_id=docset_id, doc_id=doc_id, file=out)

# Delete a document by id.
client.delete_doc(docset_id=docset_id, doc_id=doc_id)
client.get_doc_binary()

Query

You can run vector and keyword search queries on the documents stored in DocParse storage.

from aryn_sdk.client.client import Client

client = Client()
docset_id = None # my docset id

# Search by query
search_request = SearchRequest(query="test_query")
results = client.search(docset_id=docset_id, query="my query")

# Search by filter
filter_request = SearchRequest(query="test_filter_query", properties_filter="(properties.entity.name='test')")
results = client.search(docset_id=docset_id, query="my query")

Extract additional properties (metadata) from your documents

You can use LLMs to extract additional metadata from your documents in DocParse storage. These are stored as properties, and are extracted from every document in your DocSet.

from aryn_sdk.client.client import Client
from aryn_sdk.types.schema import Schema, SchemaField

client = Client()
docset_id = None # my docset id
schema_field = SchemaField(name="name", field_type="string")
schema = Schema(fields=[schema_field])

# Extract properties

client_obj.extract_properties(docset_id=docset_id, schema=schema)

# Delete extracted properties
client_obj.delete_properties(docset_id=docset_id, schema=schema)

Async APIs

Partitioning - Single Task Example

import time
from aryn_sdk.partition import partition_file_async_submit, partition_file_async_result

with open("my-favorite-pdf.pdf", "rb") as f:
    response = partition_file_async_submit(
        f,
        use_ocr=True,
        extract_table_structure=True,
    )

task_id = response["task_id"]

# Poll for the results
while True:
    result = partition_file_async_result(task_id)
    if result["task_status"] != "pending":
        break
    time.sleep(5)

Optionally, you can also set a webhook for Aryn to call when your task is completed:

partition_file_async_submit("path/to/my/file.docx", webhook_url="https://example.com/alert")

Aryn will POST a request containing a body like the below:

{"done": [{"task_id": "aryn:t-47gpd3604e5tz79z1jro5fc"}]}

Multi-Task Example

import logging
import time
from aryn_sdk.partition import partition_file_async_submit, partition_file_async_result

files = [open("file1.pdf", "rb"), open("file2.docx", "rb")]
task_ids = [None] * len(files)
for i, f in enumerate(files):
    try:
        task_ids[i] = partition_file_async_submit(f)["task_id"]
    except Exception as e:
        logging.warning(f"Failed to submit {f}: {e}")

results = [None] * len(files)
for i, task_id in enumerate(task_ids):
    while True:
        result = partition_file_async_result(task_id)
        if result["task_status"] != "pending":
            break
        time.sleep(5)
    results[i] = result

Cancelling an async task

from aryn_sdk.partition import partition_file_async_submit, partition_file_async_cancel
        task_id = partition_file_async_submit(
                    "path/to/file.pdf",
                    use_ocr=True,
                    extract_table_structure=True,
                    extract_images=True,
                )["task_id"]

        partition_file_async_cancel(task_id)

List pending tasks

from aryn_sdk.partition import partition_file_async_list
partition_file_async_list()

Async Properties (Extract and Delete) example

from aryn_sdk.client.client import Client
from aryn_sdk.types.schema import Schema, SchemaField

client = Client()

# Run extract_properties and delete_properties asynchronously
schema_field = SchemaField(name="name", field_type="string")
schema = Schema(fields=[schema_field])
client_obj.extract_properties_async(docset_id=docset_id, schema=schema) # async implementation
client_obj.delete_properties_async(docset_id=docset_id, schema=schema) # async implementation

# Check the status and get the task result
task = None # my task id
get_async_result = client.get_async_result(task=task_id)

# List all outstanding async tasks.
client.list_async_tasks()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aryn_sdk-0.2.5.tar.gz (1.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aryn_sdk-0.2.5-py3-none-any.whl (1.3 MB view details)

Uploaded Python 3

File details

Details for the file aryn_sdk-0.2.5.tar.gz.

File metadata

  • Download URL: aryn_sdk-0.2.5.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for aryn_sdk-0.2.5.tar.gz
Algorithm Hash digest
SHA256 622fbdfba57de30bc1702ce33491b0b9b98e79261dbf47f68452b1fa56641b14
MD5 1cc5e5ac419e0578248a212c2e41119f
BLAKE2b-256 372267ee70445bf5d9d2914815eaa8b017608c12919653332ba7d0c267f975ca

See more details on using hashes here.

Provenance

The following attestation bundles were made for aryn_sdk-0.2.5.tar.gz:

Publisher: release.yml on aryn-ai/aryn-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file aryn_sdk-0.2.5-py3-none-any.whl.

File metadata

  • Download URL: aryn_sdk-0.2.5-py3-none-any.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for aryn_sdk-0.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 634bc70cdcbb3aebbf6bf13fd669fe3ecdfb4540d3c587e07934d02b2facaba7
MD5 33a20b1abf2aa0e81bad10cf43c3bf6c
BLAKE2b-256 8e16c55c94a341e1890e3ccb4e666b6b6caf465864c476ca1134c9734b096ca9

See more details on using hashes here.

Provenance

The following attestation bundles were made for aryn_sdk-0.2.5-py3-none-any.whl:

Publisher: release.yml on aryn-ai/aryn-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page