Skip to main content

The client library for Aryn services.

Project description

PyPI PyPI - Python Version Slack Docs License

aryn-sdk is a simple client library for interacting with Aryn DocParse.

Partition (Parse) files

Partition PDF files with Aryn DocParse through aryn-sdk:

from aryn_sdk.partition import partition_file

with open("partition-me.pdf", "rb") as f:
    data = partition_file(
        f,
        text_mode="inline_fallback_to_ocr",
        table_mode="standard",
        extract_images=True
    )
elements = data['elements']

Convert a partitioned table element to a pandas dataframe for easier use:

from aryn_sdk.partition import partition_file, table_elem_to_dataframe

with open("partition-me.pdf", "rb") as f:
    data = partition_file(
        f,
        text_mode="standard_ocr",
        table_mode="vision",
        extract_images=True
    )

# Find the first table and convert it to a dataframe
df = None
for element in data['elements']:
    if element['type'] == 'table':
        df = table_elem_to_dataframe(element)
        break

Or convert all partitioned tables to pandas dataframes in one shot:

from aryn_sdk.partition import partition_file, tables_to_pandas

with open("partition-me.pdf", "rb") as f:
    data = partition_file(
        f,
        table_mode="standard",
        extract_images=True
    )
elements_and_tables = tables_to_pandas(data)
dataframes = [table for (element, table) in elements_and_tables if table is not None]

Visualize partitioned documents by drawing on the bounding boxes:

from aryn_sdk.partition import partition_file, draw_with_boxes

with open("partition-me.pdf", "rb") as f:
    data = partition_file(
        f,
        extract_images=True
    )
page_pics = draw_with_boxes("partition-me.pdf", data, draw_table_cells=True)

from IPython.display import display
display(page_pics[0])

Note: visualizing documents requires poppler, a pdf processing library, to be installed. Instructions for installing poppler can be found here

Convert image elements to more useful types, like PIL, or image format typed byte strings

from aryn_sdk.partition import partition_file, convert_image_element

with open("my-favorite-pdf.pdf", "rb") as f:
    data = partition_file(
        f,
        extract_images=True
    )
image_elts = [e for e in data['elements'] if e['type'] == 'Image']

pil_img = convert_image_element(image_elts[0])
jpg_bytes = convert_image_element(image_elts[1], format='JPEG')
png_str = convert_image_element(image_elts[2], format="PNG", b64encode=True)

Document storage

The DocParse storage APIs provide a simple interface to interact with documents processed and stored by DocParse.

DocSets

The DocSet APIs allow you create, list, and delete DocSets to store your documents in.

from aryn.client.client import Client

client = Client()

# Create a new DocSet and get the ID.
new_docset = client.create_docset(name="My DocSet")
docset_id = new_docset.value.docset_id

# Retrieve a specific DocSet by ID.
docset = client.get_docset(docset_id=docset_id).value

# List all of the DocSets in your account.
docsets = client.list_docsets().get_all()

# Delete the DocSet you created
client.delete_docset(docset_id=docset_id)

Documents

The document APIs let you interact with individual documents, including the ability to retrieve the original file.

from aryn.client.client import Client

client = Client()

# Iterate through the documents in a single DocSet
docset_id = None # my docset id
paginator = client.list_docs(docset_id = docset_id)
for doc in paginator:
    print(f"Doc {doc.name} has id {doc.doc_id}")

# Get a single document
doc_id = None # my doc id
doc = client.get_doc(docset_id=docset_id, doc_id=doc_id).value

# Get the original pdf of a document and write to a file.
with open("/path/to/outfile", "wb") as out:
    client.get_doc_binary(docset_id=docset_id, doc_id=doc_id, file=out)

# Delete a document by id.
client.delete_doc(docset_id=docset_id, doc_id=doc_id)
client.get_doc_binary()

Search

You can run vector and keyword search queries on the documents stored in DocParse storage.

from aryn_sdk.client.client import Client
from aryn_sdk.types.search import SearchRequest

client = Client()
docset_id = None # my docset id

# Search by query
search_request = SearchRequest(query="test_query")
results = client.search(docset_id=docset_id, query="my query")

# Search by filter
filter_request = SearchRequest(query="test_filter_query", properties_filter="(properties.entity.name='test')")
results = client.search(docset_id=docset_id, query="my query")

Query

You can do RAG and Deep Analytics on the documents stored in Docparse storage.

from aryn_sdk.client.client import Client
from aryn_sdk.types.query import Query

client = Client()
docset_id = None # my docset id

# Do RAG on the documents
query = Query(docset_id=docset_id, query="test_query", stream=True, rag_mode=True)
results = client.query(query=query)

# Do Deep Analytics on the documents
query = Query(docset_id=docset_id, query="test_query", stream=True)
results = client.query(query=query)

Extract additional properties (metadata) from your documents

You can use LLMs to extract additional metadata from your documents in DocParse storage. These are stored as properties, and are extracted from every document in your DocSet.

from aryn_sdk.client.client import Client
from aryn_sdk.types.schema import Schema, SchemaField

client = Client()
docset_id = None # my docset id
schema_field = SchemaField(name="name", field_type="string")
schema = Schema(fields=[schema_field])

# Extract properties

client_obj.extract_properties(docset_id=docset_id, schema=schema)

# Delete extracted properties
client_obj.delete_properties(docset_id=docset_id, schema=schema)

Async APIs

Partitioning - Single Task Example

import time
from aryn_sdk.partition import partition_file_async_submit, partition_file_async_result

with open("my-favorite-pdf.pdf", "rb") as f:
    response = partition_file_async_submit(
        f,
        use_ocr=True,
        extract_table_structure=True,
    )

task_id = response["task_id"]

# Poll for the results
while True:
    result = partition_file_async_result(task_id)
    if result["task_status"] != "pending":
        break
    time.sleep(5)

Optionally, you can also set a webhook for Aryn to call when your task is completed:

partition_file_async_submit("path/to/my/file.docx", webhook_url="https://example.com/alert")

Aryn will POST a request containing a body like the below:

{"done": [{"task_id": "aryn:t-47gpd3604e5tz79z1jro5fc"}]}

Multi-Task Example

import logging
import time
from aryn_sdk.partition import partition_file_async_submit, partition_file_async_result

files = [open("file1.pdf", "rb"), open("file2.docx", "rb")]
task_ids = [None] * len(files)
for i, f in enumerate(files):
    try:
        task_ids[i] = partition_file_async_submit(f)["task_id"]
    except Exception as e:
        logging.warning(f"Failed to submit {f}: {e}")

results = [None] * len(files)
for i, task_id in enumerate(task_ids):
    while True:
        result = partition_file_async_result(task_id)
        if result["task_status"] != "pending":
            break
        time.sleep(5)
    results[i] = result

Cancelling an async task

from aryn_sdk.partition import partition_file_async_submit, partition_file_async_cancel
        task_id = partition_file_async_submit(
                    "path/to/file.pdf",
                    use_ocr=True,
                    extract_table_structure=True,
                    extract_images=True,
                )["task_id"]

        partition_file_async_cancel(task_id)

List pending tasks

from aryn_sdk.partition import partition_file_async_list
partition_file_async_list()

Async Properties (Extract and Delete) example

from aryn_sdk.client.client import Client
from aryn_sdk.types.schema import Schema, SchemaField

client = Client()

# Run extract_properties and delete_properties asynchronously
schema_field = SchemaField(name="name", field_type="string")
schema = Schema(fields=[schema_field])
client_obj.extract_properties_async(docset_id=docset_id, schema=schema) # async implementation
client_obj.delete_properties_async(docset_id=docset_id, schema=schema) # async implementation

# Check the status and get the task result
task = None # my task id
get_async_result = client.get_async_result(task=task_id)

# List all outstanding async tasks.
client.list_async_tasks()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aryn_sdk-0.2.15.tar.gz (1.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aryn_sdk-0.2.15-py3-none-any.whl (1.3 MB view details)

Uploaded Python 3

File details

Details for the file aryn_sdk-0.2.15.tar.gz.

File metadata

  • Download URL: aryn_sdk-0.2.15.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for aryn_sdk-0.2.15.tar.gz
Algorithm Hash digest
SHA256 e52054f1f9acef9604831771b4010b094a8373c4e3243d3e6a162aa1915644ba
MD5 7a831961f823ee871307c93a5bee24a2
BLAKE2b-256 b5bce0b1bf1f246f0d8b6df4a129996093fe3e79e4ed53a4c70b18a45092a2f5

See more details on using hashes here.

Provenance

The following attestation bundles were made for aryn_sdk-0.2.15.tar.gz:

Publisher: release.yml on aryn-ai/aryn-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file aryn_sdk-0.2.15-py3-none-any.whl.

File metadata

  • Download URL: aryn_sdk-0.2.15-py3-none-any.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for aryn_sdk-0.2.15-py3-none-any.whl
Algorithm Hash digest
SHA256 6e66f07d523dd88855b402d4f930b3f72884ae2f33b39095e860ac4a21000c4d
MD5 b74f65ca43c99e29ddb4005810bec1d3
BLAKE2b-256 bdf66af525b3cfbc67cf533600d5b46c76f323d29f2484ba760be9045345a2fc

See more details on using hashes here.

Provenance

The following attestation bundles were made for aryn_sdk-0.2.15-py3-none-any.whl:

Publisher: release.yml on aryn-ai/aryn-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page