Skip to main content

General-purpose document data ingestion library.

Project description

PyIngestion (Codename: Gaia) โ€” Generalized Document Data Extractor

PyIngestion (project codename Gaia) is a versatile and robust document data extraction system designed to retrieve structured key-value pair (KVP) records from text and files. It is packaged both as a programmatic Python library (pyingestion) and a feature-rich command-line tool (CLI).

PyIngestion uses a modular architecture using fast native text extraction and an extensible parser interface to ensure high speed, fidelity, and future adaptability to new file formats.


๐Ÿš€ Key Features

  • Dual-Purpose Design:
    • Programmatic Library: Integrate the TransformStream, built-in or custom InputStream components, and observers directly into your own codebase.
    • Command-Line Interface: Run parsing pipelines directly from your shell with dynamic dashboards, detailed progress tracking, and configurable execution.
  • Extensible Input Stream (Parser) Architecture:
    • Fully decoupled document discovery and data extraction. Programmatic users can write and inject custom input streams (e.g., Docx, OCR, XML) by subclassing the abstract InputStream class.
  • Fast Native PDF Processing:
    • Employs fast native layout-based PDF text extraction (via pypdf) as a built-in default input stream.
  • Dynamic Terminal Interface (TUI):
    • Real-time metrics rendered via rich.live.
    • Live status dashboard featuring counters for processed files, pages, failures, and a progress bar with numerical Estimated Time of Arrival (ETA).
  • Robust Session Resume:
    • Automatically checkpoints progress using a state file (.gaia_resume.json). If interrupted, the --resume flag lets you pick up right where you left off.
  • Custom Regex Configurations:
    • Supply custom pattern matching rules via a JSON configuration file.
  • Multi-Page Unit Grouping:
    • Group multiple pages as a single unit using --pages-per-unit for patterns that span across page boundaries.
  • Internationalization (i18n):
    • Complete user interface and message translation support for English (en) and Portuguese (pt).
  • Graceful Interrupt Handlers:
    • Supports clean cancellation via ESC or Ctrl+C, ensuring resources, files, and terminal settings are restored safely.

๐Ÿ“ Project Directory Structure

Gaia/
โ”œโ”€โ”€ pyingestion/
โ”‚   โ”œโ”€โ”€ __init__.py          # Main entry points exposing library API classes
โ”‚   โ”œโ”€โ”€ __main__.py          # Main entry point for python -m pyingestion
โ”‚   โ”œโ”€โ”€ cli/
โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py      # CLI subpackage initialization
โ”‚   โ”‚   โ”œโ”€โ”€ cli_helper.py    # CLI arguments parser and prevalidation helper
โ”‚   โ”‚   โ””โ”€โ”€ terminal_ui.py   # Rich TUI display and keyboard input handling
โ”‚   โ”œโ”€โ”€ pyingestion.py       # Main global program class (PyIngestion, codename: Gaia)
โ”‚   โ”œโ”€โ”€ extraction_session.py# Session progress tracking & state serialization
โ”‚   โ”œโ”€โ”€ options.py           # Config options container class & parameter validations
โ”‚   โ”œโ”€โ”€ input_stream.py      # Abstract InputStream base, InputStreamType Enum, and InputStreamFactory
โ”‚   โ”œโ”€โ”€ i18n.py              # Gettext wrappers and language initialization
โ”‚   โ”œโ”€โ”€ locale/              # Compiled translations directory
โ”‚   โ”‚   โ”œโ”€โ”€ en/LC_MESSAGES/messages.mo
โ”‚   โ”‚   โ””โ”€โ”€ pt/LC_MESSAGES/messages.mo
โ”‚   โ”œโ”€โ”€ observer.py          # Progress notification interface (observer pattern)
โ”‚   โ”œโ”€โ”€ output_stream.py     # Output stream interfaces (OutputStream, CsvWriteStream, DefaultOutputStream)
โ”‚   โ”œโ”€โ”€ parsers.py           # Concrete InputStream implementations (PdfParser, DocxParser, OcrParser)
โ”‚   โ”œโ”€โ”€ transform_stream.py  # Abstract and concrete TransformStream and RegexEngine implementations
โ”‚   โ””โ”€โ”€ main.py              # CLI entry point implementation
โ”œโ”€โ”€ pyproject.toml           # Setuptools PEP 621 packaging definitions
โ”œโ”€โ”€ requirements.txt         # Package requirements
โ”œโ”€โ”€ tests/                   # Extensive test suites
โ””โ”€โ”€ tools/
    โ””โ”€โ”€ linux/
        โ”œโ”€โ”€ compile_locales.sh # Compiles Translation Catalog (.po -> .mo)
        โ””โ”€โ”€ run_tests.sh       # Script to execute unittest suite

๐Ÿ› ๏ธ Requirements & Installation

Prerequisites

  1. Python 3.10+

Environment Setup & Packaging

  1. Clone or navigate to the repository:

    cd Trabalho/Gaia
    
  2. Setup virtual environment:

    python -m venv .venv
    source .venv/bin/activate
    
  3. Install the package in editable mode:

    pip install -e .
    

๐Ÿ’ป Usage

1. As a Python Library

You can integrate PyIngestion directly into your Python scripts.

Orchestrating the Full Pipeline Programmatically

To execute the entire extraction pipeline on a file or directory:

from pyingestion import PyIngestion, Options

# 1. Configure options programmatically
options = Options()
options.BASE_PATH = "path/to/pdfs"
options.REGEX_FILE = "path/to/rules.json"
options.OUTPUT_CSV = "custom_output.csv"
options.PAGES_PER_UNIT = 1

# 2. Run the orchestrator
controller = PyIngestion(options)
success = controller.run()

Creating & Injecting a Custom Input Stream

You can supply your own extraction parser format by subclassing the abstract base class InputStream:

from typing import Generator
from pyingestion import PyIngestion, Options, InputStream, ExtractionSession

class CustomTxtParser(InputStream):
    def accepts(self, file_path: str) -> bool:
        # Define what files this parser/stream accepts
        return file_path.lower().endswith(".txt")

    def process_file(
        self,
        file_path: str,
        session: ExtractionSession | None = None,
        pages_per_unit: int = 1
    ) -> Generator[tuple[int, int, str], None, None]:
        # Process the file and yield: (unit_index, total_units, content_text)
        with open(file_path, "r", encoding="utf-8") as f:
            content = f.read()
        yield 1, 1, content

# Inject it into PyIngestion orchestrator
options = Options()
options.BASE_PATH = "path/to/text/files"
options.REGEX_FILE = "rules.json"

# Supply your custom input_stream (and transform_stream)
controller = PyIngestion(options, transform_stream=..., input_stream=CustomTxtParser())
controller.run()

Using Input Stream and Engine Components Directly

To parse files manually and match patterns page-by-page:

from pyingestion import PdfParser, NativeRegexEngine

# 1. Setup the Regex engine with rules in-memory (dictionary)
regex_rules = {
    "infraction_id": {
        "regex": r"Cรณdigo da Infraรงรฃo:\s*([A-Za-z0-9-]+)",
        "required": True
    },
    "plate": {
        "regex": r"Placa:\s*([A-Z]{3}-?\d[A-Z0-9]\d{2})",
        "required": True
    }
}
engine = NativeRegexEngine(regex_rules)

# Alternatively, load rules from a JSON file path:
# engine = NativeRegexEngine.from_file("path/to/rules.json")

# 2. Setup the input stream
input_stream = PdfParser()

# 3. Process files programmatically
# The input stream yields raw text segments for each page/unit.
# You then parse it using the engine.
for unit_index, total_units, raw_text in input_stream.process_file("path/to/infraction.pdf", pages_per_unit=1):
    record = engine.transform(raw_text)
    print("Parsed Record:", record)

2. Command-Line Interface (CLI)

PyIngestion can be executed directly as a global shell command, as a python module run, or as a local script.

# 1. As a global command (after package installation)
pyingestion <input_dir> [options]

# 2. As a python module run (from the repository root)
python -m pyingestion <input_dir> [options]

Positional Arguments

  • <input_dir>: Path to the directory containing files to process.

Options

  • -o, --output <path>: Custom output CSV file path (Default: output.csv in your working directory).
  • -g, --regex <path>: Path to a JSON file containing customized regex extraction rules.
  • -r, --recursive: Search for files recursively within subdirectories.
  • --resume: Resume processing using checkpoint data from .gaia_resume.json.
  • -t, --test <file_path>: Test your regex rules on the first page of the provided file.
  • -p, --pages-per-unit <int>: The number of pages/chunks grouped together as a single block for extraction matching (Default: 1).
  • -l, --lang {"en", "pt"}: Force the interface language to English or Portuguese (Default: en).
  • --type {"pdf"}: Define the built-in parser type to use (Default: pdf).

Examples

  • Basic processing run:

    pyingestion /path/to/pdfs -g rules.json
    
  • Resume an interrupted run:

    pyingestion /path/to/pdfs --resume
    
  • Test matching logic on a single file:

    pyingestion -t sample.pdf -g rules.json
    

๐Ÿงช Testing and Tools

Running the Test Suite

The unit and integration tests validate CLI logic, parser fallbacks, observers, and settings parsing.

./tools/linux/run_tests.sh

Compiling Localization Catalogs

To re-compile updated translation dictionary catalogs (.po) to gettext binary files (.mo):

./tools/linux/compile_locales.sh

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyingestion-0.5.0b1.tar.gz (40.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyingestion-0.5.0b1-py3-none-any.whl (36.5 kB view details)

Uploaded Python 3

File details

Details for the file pyingestion-0.5.0b1.tar.gz.

File metadata

  • Download URL: pyingestion-0.5.0b1.tar.gz
  • Upload date:
  • Size: 40.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for pyingestion-0.5.0b1.tar.gz
Algorithm Hash digest
SHA256 40560bd682cfb4945c3c072fbb70120a7de28e9686365b1cb18d6636795fb8a5
MD5 3bccdf3484e58d5c96e925b59e4f0bb9
BLAKE2b-256 c9bffab8c110e4ee0f097712b182aeec34d2789dbaed3aa0d7da96e4f8ad2f16

See more details on using hashes here.

File details

Details for the file pyingestion-0.5.0b1-py3-none-any.whl.

File metadata

  • Download URL: pyingestion-0.5.0b1-py3-none-any.whl
  • Upload date:
  • Size: 36.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for pyingestion-0.5.0b1-py3-none-any.whl
Algorithm Hash digest
SHA256 54f72e422726b5a2cfdc4b2add080bb0bf36c00c88c3d928de052081005b0be0
MD5 2aefe9b9a4fe2b95d5e4a8ff65e7386e
BLAKE2b-256 e1ff571887d8bc02b60334f0a7693e0a53ce6f058c84578b02752170fd0716ea

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page