Python binding to Java Archery Framework
Project description
PyArchery
PyArchery is a Python binding for the Java Archery Framework, enabling powerful semi-structured document processing directly from Python. It leverages JPype to bridge Python and Java, providing seamless access to Archery's intelligent extraction, layout analysis, and tag classification capabilities.
Description
In today's data-driven landscape, navigating the complexities of semi-structured documents poses a significant challenge. PyArchery brings the robust capabilities of the Archery framework to the Python ecosystem.
By leveraging innovative algorithms and machine learning techniques, Archery offers a solution that gives you control over the data extraction process with tweakable and repeatable settings. It automates the extraction process, saving time and minimizing errors, making it ideal for industries dealing with large volumes of documents.
Key features include:
- Intelligent Extraction: Automatically extract structured data from documents.
- Layout Analysis: Understand the physical layout of document elements.
- Tag Classification: Classify document tags using customizable styles (Snake case, Camel case, etc.).
- Java Integration: Direct access to the underlying Java Archery API for advanced usage.
Getting Started
Prerequisites
- Java Development Kit (JDK): Version 21 or higher is required.
- Python: Version 3.11 or higher.
Installation
Install PyArchery using pip:
pip install pyjarchery
Quick Start
Here's a simple example of how to use PyArchery to open a document and extract data from tables:
import pyarchery
from pyarchery.archery import defines
# Path to your document
file_path = "path/to/your/document.pdf"
# Load the document with intelligent extraction hints
# This returns a DocumentWrapper
with pyarchery.load(
file_path,
hints=[defines.INTELLI_EXTRACT, defines.INTELLI_LAYOUT]
) as doc:
# Access sheets using the pythonic wrapper property
for sheet in doc.sheets:
# Check if sheet has a table
if sheet.table:
table = sheet.table
# Convert to python dictionary
data = table.to_pydict()
print(f"Extracted data from table: {data.keys()}")
Documentation
For comprehensive documentation, tutorials, and API references, please visit:
- PyArchery Documentation: https://romualdrousseau.github.io/PyArchery/
- Java Archery Framework: https://github.com/RomualdRousseau/Archery
Contribute
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.
Authors
- Romuald Rousseau, romualdrousseau@gmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyjarchery-0.1.12.tar.gz.
File metadata
- Download URL: pyjarchery-0.1.12.tar.gz
- Upload date:
- Size: 120.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e44dd70282282f04d07ddda4a80f834b38814cc1cf1c709ebc157cdccfa3cc81
|
|
| MD5 |
e26c257c7bba82e0d7920d9807732cf0
|
|
| BLAKE2b-256 |
604d39898d82b3d8f6b41ebf6da050a40022168966fa2c8c34e2d2c3a3859e33
|
File details
Details for the file pyjarchery-0.1.12-py3-none-any.whl.
File metadata
- Download URL: pyjarchery-0.1.12-py3-none-any.whl
- Upload date:
- Size: 121.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
18a7bf5a47d7c99c248dd35b437f9225c73e83b9bf60134ee421ac4848c4476a
|
|
| MD5 |
9ec3323afbca396407198a18fa44512d
|
|
| BLAKE2b-256 |
f14c7c56f8d2588b68f078577e8b160b59d73bd57405d56721326e7a939bb75c
|