The client library for Aryn services.
Project description
aryn-sdk is a simple client library for interacting with Aryn DocParse.
Partition (Parse) files
Partition PDF files with Aryn DocParse through aryn-sdk:
from aryn_sdk.partition import partition_file
with open("partition-me.pdf", "rb") as f:
data = partition_file(
f,
text_mode="inline_fallback_to_ocr",
table_mode="standard",
extract_images=True
)
elements = data['elements']
Convert a partitioned table element to a pandas dataframe for easier use:
from aryn_sdk.partition import partition_file, table_elem_to_dataframe
with open("partition-me.pdf", "rb") as f:
data = partition_file(
f,
text_mode="standard_ocr",
table_mode="vision",
extract_images=True
)
# Find the first table and convert it to a dataframe
df = None
for element in data['elements']:
if element['type'] == 'table':
df = table_elem_to_dataframe(element)
break
Or convert all partitioned tables to pandas dataframes in one shot:
from aryn_sdk.partition import partition_file, tables_to_pandas
with open("partition-me.pdf", "rb") as f:
data = partition_file(
f,
table_mode="standard",
extract_images=True
)
elements_and_tables = tables_to_pandas(data)
dataframes = [table for (element, table) in elements_and_tables if table is not None]
Visualize partitioned documents by drawing on the bounding boxes:
from aryn_sdk.partition import partition_file, draw_with_boxes
with open("partition-me.pdf", "rb") as f:
data = partition_file(
f,
extract_images=True
)
page_pics = draw_with_boxes("partition-me.pdf", data, draw_table_cells=True)
from IPython.display import display
display(page_pics[0])
Note: visualizing documents requires
poppler, a pdf processing library, to be installed. Instructions for installing poppler can be found here
Convert image elements to more useful types, like PIL, or image format typed byte strings
from aryn_sdk.partition import partition_file, convert_image_element
with open("my-favorite-pdf.pdf", "rb") as f:
data = partition_file(
f,
extract_images=True
)
image_elts = [e for e in data['elements'] if e['type'] == 'Image']
pil_img = convert_image_element(image_elts[0])
jpg_bytes = convert_image_element(image_elts[1], format='JPEG')
png_str = convert_image_element(image_elts[2], format="PNG", b64encode=True)
Document storage
The DocParse storage APIs provide a simple interface to interact with documents processed and stored by DocParse.
DocSets
The DocSet APIs allow you create, list, and delete DocSets to store your documents in.
from aryn.client.client import Client
client = Client()
# Create a new DocSet and get the ID.
new_docset = client.create_docset(name="My DocSet")
docset_id = new_docset.value.docset_id
# Retrieve a specific DocSet by ID.
docset = client.get_docset(docset_id=docset_id).value
# List all of the DocSets in your account.
docsets = client.list_docsets().get_all()
# Delete the DocSet you created
client.delete_docset(docset_id=docset_id)
Documents
The document APIs let you interact with individual documents, including the ability to retrieve the original file.
from aryn.client.client import Client
client = Client()
# Iterate through the documents in a single DocSet
docset_id = None # my docset id
paginator = client.list_docs(docset_id = docset_id)
for doc in paginator:
print(f"Doc {doc.name} has id {doc.doc_id}")
# Get a single document
doc_id = None # my doc id
doc = client.get_doc(docset_id=docset_id, doc_id=doc_id).value
# Get the original pdf of a document and write to a file.
with open("/path/to/outfile", "wb") as out:
client.get_doc_binary(docset_id=docset_id, doc_id=doc_id, file=out)
# Delete a document by id.
client.delete_doc(docset_id=docset_id, doc_id=doc_id)
client.get_doc_binary()
Search
You can run vector and keyword search queries on the documents stored in DocParse storage.
from aryn_sdk.client.client import Client
from aryn_sdk.types.search import SearchRequest
client = Client()
docset_id = None # my docset id
# Search by query
search_request = SearchRequest(query="test_query")
results = client.search(docset_id=docset_id, query="my query")
# Search by filter
filter_request = SearchRequest(query="test_filter_query", properties_filter="(properties.entity.name='test')")
results = client.search(docset_id=docset_id, query="my query")
Query
You can do RAG and Deep Analytics on the documents stored in Docparse storage.
from aryn_sdk.client.client import Client
from aryn_sdk.types.query import Query
client = Client()
docset_id = None # my docset id
# Do RAG on the documents
query = Query(docset_id=docset_id, query="test_query", stream=True, rag_mode=True)
results = client.query(query=query)
# Do Deep Analytics on the documents
query = Query(docset_id=docset_id, query="test_query", stream=True)
results = client.query(query=query)
Extract additional properties (metadata) from your documents
You can use LLMs to extract additional metadata from your documents in DocParse storage. These are stored as properties, and are extracted from every document in your DocSet.
from aryn_sdk.client.client import Client
from aryn_sdk.types.schema import Schema, SchemaField
client = Client()
docset_id = None # my docset id
schema_field = SchemaField(name="name", field_type="string")
schema = Schema(fields=[schema_field])
# Extract properties
client_obj.extract_properties(docset_id=docset_id, schema=schema)
# Delete extracted properties
client_obj.delete_properties(docset_id=docset_id, schema=schema)
Async APIs
Partitioning - Single Task Example
import time
from aryn_sdk.partition import partition_file_async_submit, partition_file_async_result
with open("my-favorite-pdf.pdf", "rb") as f:
response = partition_file_async_submit(
f,
use_ocr=True,
extract_table_structure=True,
)
task_id = response["task_id"]
# Poll for the results
while True:
result = partition_file_async_result(task_id)
if result["task_status"] != "pending":
break
time.sleep(5)
Optionally, you can also set a webhook for Aryn to call when your task is completed:
partition_file_async_submit("path/to/my/file.docx", webhook_url="https://example.com/alert")
Aryn will POST a request containing a body like the below:
{"done": [{"task_id": "aryn:t-47gpd3604e5tz79z1jro5fc"}]}
Multi-Task Example
import logging
import time
from aryn_sdk.partition import partition_file_async_submit, partition_file_async_result
files = [open("file1.pdf", "rb"), open("file2.docx", "rb")]
task_ids = [None] * len(files)
for i, f in enumerate(files):
try:
task_ids[i] = partition_file_async_submit(f)["task_id"]
except Exception as e:
logging.warning(f"Failed to submit {f}: {e}")
results = [None] * len(files)
for i, task_id in enumerate(task_ids):
while True:
result = partition_file_async_result(task_id)
if result["task_status"] != "pending":
break
time.sleep(5)
results[i] = result
Cancelling an async task
from aryn_sdk.partition import partition_file_async_submit, partition_file_async_cancel
task_id = partition_file_async_submit(
"path/to/file.pdf",
use_ocr=True,
extract_table_structure=True,
extract_images=True,
)["task_id"]
partition_file_async_cancel(task_id)
List pending tasks
from aryn_sdk.partition import partition_file_async_list
partition_file_async_list()
Async Properties (Extract and Delete) example
from aryn_sdk.client.client import Client
from aryn_sdk.types.schema import Schema, SchemaField
client = Client()
# Run extract_properties and delete_properties asynchronously
schema_field = SchemaField(name="name", field_type="string")
schema = Schema(fields=[schema_field])
client_obj.extract_properties_async(docset_id=docset_id, schema=schema) # async implementation
client_obj.delete_properties_async(docset_id=docset_id, schema=schema) # async implementation
# Check the status and get the task result
task = None # my task id
get_async_result = client.get_async_result(task=task_id)
# List all outstanding async tasks.
client.list_async_tasks()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aryn_sdk-0.2.11.tar.gz.
File metadata
- Download URL: aryn_sdk-0.2.11.tar.gz
- Upload date:
- Size: 1.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7f305dae88277c5ee51098c872ffd5f1cadbebfabc7346363dc45806d1028716
|
|
| MD5 |
6254fac1ba4e95b56620a2ae49698300
|
|
| BLAKE2b-256 |
8b0f834379088ccb5fd45adccb0f88538207f13fc4ba83a988a2d8cd51ab1859
|
Provenance
The following attestation bundles were made for aryn_sdk-0.2.11.tar.gz:
Publisher:
release.yml on aryn-ai/aryn-sdk
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
aryn_sdk-0.2.11.tar.gz -
Subject digest:
7f305dae88277c5ee51098c872ffd5f1cadbebfabc7346363dc45806d1028716 - Sigstore transparency entry: 341229845
- Sigstore integration time:
-
Permalink:
aryn-ai/aryn-sdk@731ce3612f12d8a46811f0a109e8490d6eb75577 -
Branch / Tag:
refs/tags/v0.2.11 - Owner: https://github.com/aryn-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@731ce3612f12d8a46811f0a109e8490d6eb75577 -
Trigger Event:
release
-
Statement type:
File details
Details for the file aryn_sdk-0.2.11-py3-none-any.whl.
File metadata
- Download URL: aryn_sdk-0.2.11-py3-none-any.whl
- Upload date:
- Size: 1.3 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
038a200fec81232af9573ac64eb648279f78df43efe8d9b9601c241146e265d7
|
|
| MD5 |
de52daa9a1ac9c1d3a26cb772d051ebc
|
|
| BLAKE2b-256 |
a6045a76312b805050834def458bfbb81f84c312afdc253fa8dd33bfd6ab0244
|
Provenance
The following attestation bundles were made for aryn_sdk-0.2.11-py3-none-any.whl:
Publisher:
release.yml on aryn-ai/aryn-sdk
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
aryn_sdk-0.2.11-py3-none-any.whl -
Subject digest:
038a200fec81232af9573ac64eb648279f78df43efe8d9b9601c241146e265d7 - Sigstore transparency entry: 341229872
- Sigstore integration time:
-
Permalink:
aryn-ai/aryn-sdk@731ce3612f12d8a46811f0a109e8490d6eb75577 -
Branch / Tag:
refs/tags/v0.2.11 - Owner: https://github.com/aryn-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@731ce3612f12d8a46811f0a109e8490d6eb75577 -
Trigger Event:
release
-
Statement type: