Skip to main content

Python binding to Java Archery Framework

Project description

PyArchery

License: GPL v3 Servier Inspired

PyArchery is a Python binding for the Java Archery Framework, enabling powerful semi-structured document processing directly from Python. It leverages JPype to bridge Python and Java, providing seamless access to Archery's intelligent extraction, layout analysis, and tag classification capabilities.

Description

In today's data-driven landscape, navigating the complexities of semi-structured documents poses a significant challenge. PyArchery brings the robust capabilities of the Archery framework to the Python ecosystem.

By leveraging innovative algorithms and machine learning techniques, Archery offers a solution that gives you control over the data extraction process with tweakable and repeatable settings. It automates the extraction process, saving time and minimizing errors, making it ideal for industries dealing with large volumes of documents.

Key features include:

  • Intelligent Extraction: Automatically extract structured data from documents.
  • Layout Analysis: Understand the physical layout of document elements.
  • Tag Classification: Classify document tags using customizable styles (Snake case, Camel case, etc.).
  • Java Integration: Direct access to the underlying Java Archery API for advanced usage.

Getting Started

Prerequisites

  • Java Development Kit (JDK): Version 21 or higher is required.
  • Python: Version 3.11 or higher.

Installation

Install PyArchery using pip:

pip install pyjarchery

Quick Start

Here's a simple example of how to use PyArchery to open a document and extract data from tables:

import pyarchery
from pyarchery.archery import defines

# Path to your document
file_path = "path/to/your/document.pdf"

# Load the document with intelligent extraction hints
# This returns a DocumentWrapper
with pyarchery.load(
    file_path,
    hints=[defines.INTELLI_EXTRACT, defines.INTELLI_LAYOUT]
) as doc:
    # Access sheets using the pythonic wrapper property
    for sheet in doc.sheets:
        # Check if sheet has a table
        if sheet.table:
            table = sheet.table
            # Convert to python dictionary
            data = table.to_pydict()
            print(f"Extracted data from table: {data.keys()}")

Documentation

For comprehensive documentation, tutorials, and API references, please visit:

Contribute

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

Authors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyjarchery-0.1.12.tar.gz (120.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyjarchery-0.1.12-py3-none-any.whl (121.4 kB view details)

Uploaded Python 3

File details

Details for the file pyjarchery-0.1.12.tar.gz.

File metadata

  • Download URL: pyjarchery-0.1.12.tar.gz
  • Upload date:
  • Size: 120.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.12

File hashes

Hashes for pyjarchery-0.1.12.tar.gz
Algorithm Hash digest
SHA256 e44dd70282282f04d07ddda4a80f834b38814cc1cf1c709ebc157cdccfa3cc81
MD5 e26c257c7bba82e0d7920d9807732cf0
BLAKE2b-256 604d39898d82b3d8f6b41ebf6da050a40022168966fa2c8c34e2d2c3a3859e33

See more details on using hashes here.

File details

Details for the file pyjarchery-0.1.12-py3-none-any.whl.

File metadata

  • Download URL: pyjarchery-0.1.12-py3-none-any.whl
  • Upload date:
  • Size: 121.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.12

File hashes

Hashes for pyjarchery-0.1.12-py3-none-any.whl
Algorithm Hash digest
SHA256 18a7bf5a47d7c99c248dd35b437f9225c73e83b9bf60134ee421ac4848c4476a
MD5 9ec3323afbca396407198a18fa44512d
BLAKE2b-256 f14c7c56f8d2588b68f078577e8b160b59d73bd57405d56721326e7a939bb75c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page