aryn-sdk

The client library for Aryn services.

These details have not been verified by PyPI

Project description

License

aryn-sdk is a simple client library for interacting with Aryn DocParse.

Partition (Parse) files

Partition PDF files with Aryn DocParse through aryn-sdk:

from aryn_sdk.partition import partition_file

with open("partition-me.pdf", "rb") as f:
    data = partition_file(
        f,
        use_ocr=True,
        extract_table_structure=True,
        extract_images=True
    )
elements = data['elements']

Convert a partitioned table element to a pandas dataframe for easier use:

from aryn_sdk.partition import partition_file, table_elem_to_dataframe

with open("partition-me.pdf", "rb") as f:
    data = partition_file(
        f,
        use_ocr=True,
        extract_table_structure=True,
        extract_images=True
    )

# Find the first table and convert it to a dataframe
df = None
for element in data['elements']:
    if element['type'] == 'table':
        df = table_elem_to_dataframe(element)
        break

Or convert all partitioned tables to pandas dataframes in one shot:

from aryn_sdk.partition import partition_file, tables_to_pandas

with open("partition-me.pdf", "rb") as f:
    data = partition_file(
        f,
        use_ocr=True,
        extract_table_structure=True,
        extract_images=True
    )
elements_and_tables = tables_to_pandas(data)
dataframes = [table for (element, table) in elements_and_tables if table is not None]

Visualize partitioned documents by drawing on the bounding boxes:

from aryn_sdk.partition import partition_file, draw_with_boxes

with open("partition-me.pdf", "rb") as f:
    data = partition_file(
        f,
        use_ocr=True,
        extract_table_structure=True,
        extract_images=True
    )
page_pics = draw_with_boxes("partition-me.pdf", data, draw_table_cells=True)

from IPython.display import display
display(page_pics[0])

Note: visualizing documents requires poppler, a pdf processing library, to be installed. Instructions for installing poppler can be found here

Convert image elements to more useful types, like PIL, or image format typed byte strings

from aryn_sdk.partition import partition_file, convert_image_element

with open("my-favorite-pdf.pdf", "rb") as f:
    data = partition_file(
        f,
        extract_images=True
    )
image_elts = [e for e in data['elements'] if e['type'] == 'Image']

pil_img = convert_image_element(image_elts[0])
jpg_bytes = convert_image_element(image_elts[1], format='JPEG')
png_str = convert_image_element(image_elts[2], format="PNG", b64encode=True)

Document storage

The DocParse storage APIs provide a simple interface to interact with documents processed and stored by DocParse.

DocSets

The DocSet APIs allow you create, list, and delete DocSets to store your documents in.

from aryn.client.client import Client

client = Client()

# Create a new DocSet and get the ID.
new_docset = client.create_docset(name="My DocSet")
docset_id = new_docset.value.docset_id

# Retrieve a specific DocSet by ID.
docset = client.get_docset(docset_id=docset_id).value

# List all of the DocSets in your account.
docsets = client.list_docsets().get_all()

# Delete the DocSet you created
client.delete_docset(docset_id=docset_id)

Documents

The document APIs let you interact with individual documents, including the ability to retrieve the original file.

from aryn.client.client import Client

client = Client()

# Iterate through the documents in a single DocSet
docset_id = None # my docset id
paginator = client.list_docs(docset_id = docset_id)
for doc in paginator:
    print(f"Doc {doc.name} has id {doc.doc_id}")

# Get a single document
doc_id = None # my doc id
doc = client.get_doc(docset_id=docset_id, doc_id=doc_id).value

# Get the original pdf of a document and write to a file.
with open("/path/to/outfile", "wb") as out:
    client.get_doc_binary(docset_id=docset_id, doc_id=doc_id, file=out)

# Delete a document by id.
client.delete_doc(docset_id=docset_id, doc_id=doc_id)
client.get_doc_binary()

Query

You can run vector and keyword search queries on the documents stored in DocParse storage.

from aryn_sdk.client.client import Client

client = Client()
docset_id = None # my docset id

# Search by query
search_request = SearchRequest(query="test_query")
results = client.search(docset_id=docset_id, query="my query")

# Search by filter
filter_request = SearchRequest(query="test_filter_query", properties_filter="(properties.entity.name='test')")
results = client.search(docset_id=docset_id, query="my query")

Extract additional properties (metadata) from your documents

You can use LLMs to extract additional metadata from your documents in DocParse storage. These are stored as properties, and are extracted from every document in your DocSet.

from aryn_sdk.client.client import Client
from aryn_sdk.types.schema import Schema, SchemaField

client = Client()
docset_id = None # my docset id
schema_field = SchemaField(name="name", field_type="string")
schema = Schema(fields=[schema_field])

# Extract properties

client_obj.extract_properties(docset_id=docset_id, schema=schema)

# Delete extracted properties
client_obj.delete_properties(docset_id=docset_id, schema=schema)

Async APIs

Partitioning - Single Task Example

import time
from aryn_sdk.partition import partition_file_async_submit, partition_file_async_result

with open("my-favorite-pdf.pdf", "rb") as f:
    response = partition_file_async_submit(
        f,
        use_ocr=True,
        extract_table_structure=True,
    )

task_id = response["task_id"]

# Poll for the results
while True:
    result = partition_file_async_result(task_id)
    if result["task_status"] != "pending":
        break
    time.sleep(5)

Optionally, you can also set a webhook for Aryn to call when your task is completed:

partition_file_async_submit("path/to/my/file.docx", webhook_url="https://example.com/alert")

Aryn will POST a request containing a body like the below:

{"done": [{"task_id": "aryn:t-47gpd3604e5tz79z1jro5fc"}]}

Multi-Task Example

import logging
import time
from aryn_sdk.partition import partition_file_async_submit, partition_file_async_result

files = [open("file1.pdf", "rb"), open("file2.docx", "rb")]
task_ids = [None] * len(files)
for i, f in enumerate(files):
    try:
        task_ids[i] = partition_file_async_submit(f)["task_id"]
    except Exception as e:
        logging.warning(f"Failed to submit {f}: {e}")

results = [None] * len(files)
for i, task_id in enumerate(task_ids):
    while True:
        result = partition_file_async_result(task_id)
        if result["task_status"] != "pending":
            break
        time.sleep(5)
    results[i] = result

Cancelling an async task

from aryn_sdk.partition import partition_file_async_submit, partition_file_async_cancel
        task_id = partition_file_async_submit(
                    "path/to/file.pdf",
                    use_ocr=True,
                    extract_table_structure=True,
                    extract_images=True,
                )["task_id"]

        partition_file_async_cancel(task_id)

List pending tasks

from aryn_sdk.partition import partition_file_async_list
partition_file_async_list()

Async Properties (Extract and Delete) example

from aryn_sdk.client.client import Client
from aryn_sdk.types.schema import Schema, SchemaField

client = Client()

# Run extract_properties and delete_properties asynchronously
schema_field = SchemaField(name="name", field_type="string")
schema = Schema(fields=[schema_field])
client_obj.extract_properties_async(docset_id=docset_id, schema=schema) # async implementation
client_obj.delete_properties_async(docset_id=docset_id, schema=schema) # async implementation

# Check the status and get the task result
task = None # my task id
get_async_result = client.get_async_result(task=task_id)

# List all outstanding async tasks.
client.list_async_tasks()

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.15

Feb 20, 2026

0.2.14

Dec 4, 2025

0.2.13

Oct 27, 2025

0.2.12

Sep 9, 2025

0.2.11

Aug 1, 2025

0.2.10

Jul 7, 2025

0.2.9

Jul 7, 2025

0.2.8

Jun 17, 2025

0.2.7

Jun 5, 2025

0.2.6

May 21, 2025

0.2.5

May 15, 2025

0.2.4

May 6, 2025

0.2.3

Apr 23, 2025

This version

0.2.2

Apr 17, 2025

0.2.1

Mar 25, 2025

0.2.0

Mar 25, 2025

0.1.17

Mar 13, 2025

0.1.16

Mar 5, 2025

0.1.15

Feb 25, 2025

0.1.14

Feb 11, 2025

0.1.13

Feb 4, 2025

0.1.12.post0

Jan 29, 2025

0.1.12 yanked

Jan 29, 2025

Reason this release was yanked:

Hardcoded self-reported version number was 0.1.11

0.1.11

Jan 24, 2025

0.1.10

Dec 5, 2024

0.1.9

Nov 9, 2024

0.1.8

Oct 25, 2024

0.1.7

Oct 21, 2024

0.1.6

Oct 9, 2024

0.1.5

Oct 1, 2024

0.1.4

Sep 18, 2024

0.1.3

Aug 22, 2024

0.1.2

Aug 15, 2024

0.1.1

Aug 2, 2024

0.1.0

Jul 30, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aryn_sdk-0.2.2.tar.gz (1.3 MB view details)

Uploaded Apr 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

aryn_sdk-0.2.2-py3-none-any.whl (1.3 MB view details)

Uploaded Apr 17, 2025 Python 3

File details

Details for the file aryn_sdk-0.2.2.tar.gz.

File metadata

Download URL: aryn_sdk-0.2.2.tar.gz
Upload date: Apr 17, 2025
Size: 1.3 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for aryn_sdk-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`a4dc1d54c7b01241b62d002738765359df0275e18eaeb77c3de06c06f01bfcf2`
MD5	`21022591b9653a406513d10a5153fdf2`
BLAKE2b-256	`fcdee02066a1ea66f756792a1873b41f0a8c13623104bb99c37319c4de8d207d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for aryn_sdk-0.2.2.tar.gz:

Publisher: release.yml on aryn-ai/aryn-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: aryn_sdk-0.2.2.tar.gz
- Subject digest: a4dc1d54c7b01241b62d002738765359df0275e18eaeb77c3de06c06f01bfcf2
- Sigstore transparency entry: 198907373
- Sigstore integration time: Apr 17, 2025
Source repository:
- Permalink: aryn-ai/aryn-sdk@92e78c4a84c35ad56390d9a635940235161434a9
- Branch / Tag: refs/tags/v0.2.2
- Owner: https://github.com/aryn-ai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@92e78c4a84c35ad56390d9a635940235161434a9
- Trigger Event: release

File details

Details for the file aryn_sdk-0.2.2-py3-none-any.whl.

File metadata

Download URL: aryn_sdk-0.2.2-py3-none-any.whl
Upload date: Apr 17, 2025
Size: 1.3 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for aryn_sdk-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7e2aaa4e94825f302f3623a38e34d49e5eb9a1a4b0a0dcd244f4d8c1c40d9e15`
MD5	`4c1e8c443c2faf608003ab5bd6451de2`
BLAKE2b-256	`9287ff5b2a0a7e600ea62374e5d3ca5aebfce31dc7ff818fc871aef4ad188b9d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for aryn_sdk-0.2.2-py3-none-any.whl:

Publisher: release.yml on aryn-ai/aryn-sdk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: aryn_sdk-0.2.2-py3-none-any.whl
- Subject digest: 7e2aaa4e94825f302f3623a38e34d49e5eb9a1a4b0a0dcd244f4d8c1c40d9e15
- Sigstore transparency entry: 198907378
- Sigstore integration time: Apr 17, 2025
Source repository:
- Permalink: aryn-ai/aryn-sdk@92e78c4a84c35ad56390d9a635940235161434a9
- Branch / Tag: refs/tags/v0.2.2
- Owner: https://github.com/aryn-ai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@92e78c4a84c35ad56390d9a635940235161434a9
- Trigger Event: release

aryn-sdk 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Partition (Parse) files

Document storage

DocSets

Documents

Query

Extract additional properties (metadata) from your documents

Async APIs

Partitioning - Single Task Example

Multi-Task Example

Cancelling an async task

List pending tasks

Async Properties (Extract and Delete) example

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance