Skip to main content

The main standards for Latis Document AI project

Project description

DocumentAI-std

DocumentAI-std is a Python library crafted for streamlined and standardized document analysis, with a particular focus on tasks like document element processing, optical character recognition (OCR), and dataset management for key information extraction. Designed for ease of use and flexibility, it provides tools to work with Visually Rich Documents (VRDs) and supports datasets like Wildreceipt, XFUND, FUNSD, and CORD. Whether you're working on research or developing production systems, DocumentAI-std simplifies document processing workflows.

Features

  • Visually Rich Document Object Model (VRDOM): A framework for representing document content through structured elements with bounding boxes and content types.
  • Support for OCR: Seamlessly integrate OCR results to create a structured document representation.
  • Document Dataset Handling: Utilities for managing datasets like Wildreceipt, including loading, annotation, and validation.

Installation

You can install DocumentAI-std using the following command:

pip install DocumentAI-std

Quick Start

Here's a basic usage example demonstrating how to use the Wildreceipt dataset with DocumentAI-std.

Example: Loading and Using the Wildreceipt Dataset

from DocumentAI_std.datasets import Wildreceipt

# Load the training dataset
train_set = Wildreceipt(
    train=True,
    img_folder="/path/to/train/images/",
    label_path="/path/to/train/annotations.json",
)

# Load the test dataset
test_set = Wildreceipt(
    train=False,
    img_folder="/path/to/test/images/",
    label_path="/path/to/test/annotations.json",
)

# Assert the number of data samples in the train and test sets
assert len(train_set.data) == 1267
assert len(test_set.data) == 472

Explanation:

  • The Wildreceipt dataset is loaded by specifying the image folder and annotations path for both training and test data.
  • train=True loads the training subset, while train=False loads the test subset.
  • The number of data samples can be checked using the data attribute.

Core Components

VRDOM (Visually Rich Document Object Model)

VRDOM is a core component of DocumentAI-std that provides an abstraction layer for representing and manipulating document elements based on their visual structure. It allows users to analyze and extract meaningful information from documents, particularly those with complex layouts like invoices, receipts, and forms.

Document Class

The Document class represents a visually rich document, consisting of content elements defined by bounding boxes, content type, and associated metadata.

Attributes:

  • img_path (str): Path to the document image file.
  • ocr_output (dict): OCR output with bounding boxes and content.
    • bbox: List of bounding box coordinates.
    • content: List of content strings corresponding to the bounding boxes.
  • elements (List[DocElement]): A list of document elements (DocElement objects) representing individual content pieces.
  • device (str): The device to be used for processing (default: "cpu").

Example Usage:

from DocumentAI_std.base.document import Document

# Define OCR output for the document
ocr_output = {
    "bbox": [[10, 20, 100, 200], [110, 120, 180, 220]],
    "content": ["Total: $5.00", "Item: Coffee"]
}

# Create a Document instance
doc = Document(img_path="/path/to/document.jpg", ocr_output=ocr_output)

# Access the document's properties
print(f"Document filename: {doc.filename}")
print(f"Document shape: {doc.shape}")
print(f"Number of elements: {len(doc.elements)}")

# Serialize the document into JSON
doc_json = doc.to_json()
print(doc_json)

DocElement Class

Each Document is composed of several DocElement objects. These represent individual content elements within the document (e.g., text boxes, images) and are typically defined by their bounding box and content.

Attributes:

  • x, y (int): Coordinates for the bounding box's top-left corner.
  • w, h (int): Width and height of the bounding box.
  • content_type (ContentType): The type of content (e.g., text, image).
  • content (Any): The content inside the bounding box.
  • device (str): The processing device (e.g., "cpu").

Methods:

  • serialize: Serializes the DocElement into a dictionary format for JSON compatibility.
  • area: Calculates the area of the bounding box.
  • extract_pixels: Extracts pixel data from the bounding box region of the image.

Example Usage:

from DocumentAI_std.base.doc_element import DocElement, ContentType

# Create a document element
element = DocElement(x=10, y=20, w=100, h=50, content_type=ContentType.TEXT, content="Total: $5.00")

# Access properties
print(f"Bounding box: {element.x}, {element.y}, {element.w}, {element.h}")
print(f"Content: {element.content}")
print(f"Area: {element.area()}")

# Serialize the element
element_json = element.to_json()
print(element_json)

Dataset Support

Wildreceipt Dataset

Represents the WildReceipt dataset for key information extraction, as introduced in "Spatial Dual-Modality Graph Reasoning for Key Information Extraction".

Attributes:

  • data: List of document entities, including OCR and bounding box information.
  • root: The root directory where document images are stored.
  • train: Boolean indicating if the dataset is for training or testing.

XFUND Dataset

The XFUND dataset supports multilingual form understanding tasks. It’s designed for extracting structured information from documents in multiple languages.

FUNSD Dataset

The FUNSD dataset is focused on form understanding, useful for tasks that require extracting text and structured data from scanned forms.

CORD Dataset

The CORD dataset is designed for post-OCR parsing of consolidated receipt documents.


Additional Features

  • OCR Integration: The Document class can seamlessly ingest OCR outputs, structuring the raw text into a well-organized document representation. This is particularly useful for downstream tasks such as information extraction, table parsing, and content classification. For example, you can input OCR data from an invoice and easily extract line items, total amounts, or vendor details for further processing.

    ocr_output = [{"text": "Invoice", "bbox": [10, 20, 200, 40]}, {"text": "Total: $500", "bbox": [10, 50, 200, 70]}]
    document = Document.from_ocr(ocr_output)
    
  • Serialization: The Document and DocElement classes provide methods to serialize their structures into JSON-compatible dictionaries. This enables you to export or save documents for further analysis in a structured format.

    serialized_doc = document.to_dict()
    # Output JSON representation of the document
    
  • Distance Calculations: The library includes multiple utilities for calculating distances between elements within a document, which is useful for layout analysis and spatial relationships.

    • Euclidean Distance: Measures the straight-line distance between two points.
    • Manhattan Distance: Computes the distance along grid-like paths (right angles).
    • Chebyshev Distance: Identifies the maximum distance across any axis between two points.

    These can be leveraged for tasks like determining the proximity of elements in forms or detecting alignment in scanned documents.

    dist = Distance.euclidean((x1, y1), (x2, y2))
    
  • Image Utilities: The ImageUtils class provides essential functions for image analysis, including:

    • Entropy Calculation: Measures the randomness of pixel values, which is useful for detecting areas with high detail.
    • Gabor Filters: Generates Gabor filters for advanced texture analysis, often used in image segmentation.

    For instance, to calculate entropy within a document image:

    entropy = ImageUtils.calculate_entropy(image, bbox)
    
  • Textual Utilities: The library includes tools to handle various text-processing tasks, such as:

    • Levenshtein Distance: Measures how different two strings are, useful for fuzzy matching.
    • Special Character Identification: Detects symbols or special characters in text.
    • Numeric Percentage Calculation: Computes the percentage of numeric characters in a text, useful for extracting financial data.
    • Date Detection: Recognizes whether a string contains a date, useful in form and document parsing.
    is_date = TextUtils.is_date("2024-09-13")
    
  • Layout Utilities: Analyze the spatial layout of document elements with built-in functions:

    • Horizontal and Vertical Alignment: Determines if elements are aligned in a row or column.
    • Overlap Calculation: Calculates how much two bounding boxes overlap, which can help identify related fields in forms.
    • Relative Positioning: Checks whether an element is positioned in the top, middle, or bottom portion of a document.
    is_aligned = LayoutUtils.is_horizontally_aligned(element1, element2)
    
  • OCR Adapter: The OCRAdapter class simplifies the integration of various OCR engines, including PaddleOCR, EasyOCR, and Tesseract. It standardizes OCR outputs, making it easy to pass OCR results into the Document class regardless of the engine used. This ensures a consistent workflow for analyzing OCR results across different engines.

    ocr_adapter = OCRAdapter(engine="Tesseract")
    ocr_output = ocr_adapter.run_ocr(image)
    document = Document.from_ocr(ocr_output)
    

Contributing

Contributions to DocumentAI-std are welcome! If you’d like to contribute, please follow these steps:

  1. Fork the repository.
  2. Create a feature branch.
  3. Commit your changes.
  4. Submit a pull request.

Make sure to follow the contribution guidelines for more details.

License

DocumentAI-std is licensed under the MIT License. See the LICENSE file for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

documentai_std-0.3.8.dev3.tar.gz (26.5 kB view details)

Uploaded Source

Built Distribution

DocumentAI_std-0.3.8.dev3-py3-none-any.whl (33.7 kB view details)

Uploaded Python 3

File details

Details for the file documentai_std-0.3.8.dev3.tar.gz.

File metadata

  • Download URL: documentai_std-0.3.8.dev3.tar.gz
  • Upload date:
  • Size: 26.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for documentai_std-0.3.8.dev3.tar.gz
Algorithm Hash digest
SHA256 d6538311f3f646fe3bda40b3d22b5a60514a1d5c9d631bb47ef66c4f3def67dd
MD5 42ab2ab6dc982b33dd74c69ea39ce17e
BLAKE2b-256 f2cb3a16b0474a20b718eed5cd9241352aa64c49121682578901b325b2c5c254

See more details on using hashes here.

File details

Details for the file DocumentAI_std-0.3.8.dev3-py3-none-any.whl.

File metadata

File hashes

Hashes for DocumentAI_std-0.3.8.dev3-py3-none-any.whl
Algorithm Hash digest
SHA256 8ed455112780b3ac61dfef3b8f693bafb21847df15971dc329e1916baeec3139
MD5 a9db8168075e8d288a161da343c726f1
BLAKE2b-256 983d9c73f9272a4da53db3dc776994ffd8f7879ff0287755e42bc910154a3008

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page