General-purpose document data ingestion library.

These details have not been verified by PyPI

Project description

PyIngestion (Codename: Gaia) — Generalized Document Data Extractor

PyIngestion (project codename Gaia) is a versatile and robust document data extraction system designed to retrieve structured key-value pair (KVP) records from text and files. It is packaged both as a programmatic Python library (pyingestion) and a feature-rich command-line tool (CLI).

PyIngestion uses a modular architecture using fast native text extraction and an extensible parser interface to ensure high speed, fidelity, and future adaptability to new file formats.

🚀 Key Features

Dual-Purpose Design:
- Programmatic Library: Integrate the TransformStream, built-in or custom InputStream components, and observers directly into your own codebase.
- Command-Line Interface: Run parsing pipelines directly from your shell with dynamic dashboards, detailed progress tracking, and configurable execution.
Extensible Input Stream (Parser) Architecture:
- Fully decoupled document discovery and data extraction. Programmatic users can write and inject custom input streams (e.g., Docx, OCR, XML) by subclassing the abstract InputStream class.
Fast Native PDF Processing:
- Employs fast native layout-based PDF text extraction (via pypdf) as a built-in default input stream.
Dynamic Terminal Interface (TUI):
- Real-time metrics rendered via rich.live.
- Live status dashboard featuring counters for processed files, pages, failures, and a progress bar with numerical Estimated Time of Arrival (ETA).
Robust Session Resume:
- Automatically checkpoints progress using a state file (.gaia_resume.json). If interrupted, the --resume flag lets you pick up right where you left off.
Custom Regex Configurations:
- Supply custom pattern matching rules via a JSON configuration file.
Multi-Page Unit Grouping:
- Group multiple pages as a single unit using --pages-per-unit for patterns that span across page boundaries.
Internationalization (i18n):
- Complete user interface and message translation support for English (en) and Portuguese (pt).
Graceful Interrupt Handlers:
- Supports clean cancellation via ESC or Ctrl+C, ensuring resources, files, and terminal settings are restored safely.

📁 Project Directory Structure

Gaia/
├── pyingestion/
│   ├── __init__.py          # Main entry points exposing library API classes
│   ├── __main__.py          # Main entry point for python -m pyingestion
│   ├── cli/
│   │   ├── __init__.py      # CLI subpackage initialization
│   │   ├── cli_helper.py    # CLI arguments parser and prevalidation helper
│   │   └── terminal_ui.py   # Rich TUI display and keyboard input handling
│   ├── pyingestion.py       # Main global program class (PyIngestion, codename: Gaia)
│   ├── extraction_session.py# Session progress tracking & state serialization
│   ├── options.py           # Config options container class & parameter validations
│   ├── input_stream.py      # Abstract InputStream base, InputStreamType Enum, and InputStreamFactory
│   ├── i18n.py              # Gettext wrappers and language initialization
│   ├── locale/              # Compiled translations directory
│   │   ├── en/LC_MESSAGES/messages.mo
│   │   └── pt/LC_MESSAGES/messages.mo
│   ├── observer.py          # Progress notification interface (observer pattern)
│   ├── output_stream.py     # Output stream interfaces (OutputStream, CsvWriteStream, DefaultOutputStream)
│   ├── parsers.py           # Concrete InputStream implementations (PdfParser, DocxParser, OcrParser)
│   ├── transform_stream.py  # Abstract and concrete TransformStream and RegexEngine implementations
│   └── main.py              # CLI entry point implementation
├── pyproject.toml           # Setuptools PEP 621 packaging definitions
├── requirements.txt         # Package requirements
├── tests/                   # Extensive test suites
└── tools/
    └── linux/
        ├── compile_locales.sh # Compiles Translation Catalog (.po -> .mo)
        └── run_tests.sh       # Script to execute unittest suite

🛠️ Requirements & Installation

Prerequisites

Python 3.10+

Environment Setup & Packaging

Clone or navigate to the repository:
```
cd Trabalho/Gaia
```

Setup virtual environment:

python -m venv .venv
source .venv/bin/activate

Install the package in editable mode:
```
pip install -e .
```

💻 Usage

1. As a Python Library

You can integrate PyIngestion directly into your Python scripts.

Orchestrating the Full Pipeline Programmatically

To execute the entire extraction pipeline on a file or directory:

from pyingestion import PyIngestion, Options, NativeRegexEngine

# 1. Configure options programmatically
options = Options()
options.BASE_PATH = "path/to/pdfs"
options.OUTPUT_CSV = "custom_output.csv"
options.PAGES_PER_UNIT = 1

# 2. Load transform stream
transform = NativeRegexEngine.from_file("path/to/rules.json")

# 3. Run the orchestrator
controller = PyIngestion(options, transform_stream=transform)
success = controller.run()

Creating & Injecting a Custom Input Stream

You can supply your own extraction parser format by subclassing the abstract base class InputStream:

from typing import Generator
from pyingestion import PyIngestion, Options, InputStream, ExtractionSession

class CustomTxtParser(InputStream):
    def accepts(self, file_path: str) -> bool:
        # Define what files this parser/stream accepts
        return file_path.lower().endswith(".txt")

    def process_file(
        self,
        file_path: str,
        session: ExtractionSession | None = None,
        pages_per_unit: int = 1
    ) -> Generator[tuple[int, int, str], None, None]:
        # Process the file and yield: (unit_index, total_units, content_text)
        with open(file_path, "r", encoding="utf-8") as f:
            content = f.read()
        yield 1, 1, content

# Inject it into PyIngestion orchestrator
options = Options()
options.BASE_PATH = "path/to/text/files"

# Supply your custom input_stream and transform_stream
transform = NativeRegexEngine.from_file("rules.json")
controller = PyIngestion(options, transform_stream=transform, input_stream=CustomTxtParser())
controller.run()

Using Input Stream and Engine Components Directly

To parse files manually and match patterns page-by-page:

from pyingestion import PdfParser, NativeRegexEngine

# 1. Setup the Regex engine with rules in-memory (dictionary)
regex_rules = {
    "infraction_id": {
        "regex": r"Código da Infração:\s*([A-Za-z0-9-]+)",
        "required": True
    },
    "plate": {
        "regex": r"Placa:\s*([A-Z]{3}-?\d[A-Z0-9]\d{2})",
        "required": True
    }
}
engine = NativeRegexEngine(regex_rules)

# Alternatively, load rules from a JSON file path:
# engine = NativeRegexEngine.from_file("path/to/rules.json")

# 2. Setup the input stream
input_stream = PdfParser()

# 3. Process files programmatically
# The input stream yields raw text segments for each page/unit.
# You then parse it using the engine.
for unit_index, total_units, raw_text in input_stream.process_file("path/to/infraction.pdf", pages_per_unit=1):
    record = engine.transform(raw_text)
    print("Parsed Record:", record)

2. Command-Line Interface (CLI)

PyIngestion can be executed directly as a global shell command, as a python module run, or as a local script.

# 1. As a global command (after package installation)
pyingestion <input_dir> [options]

# 2. As a python module run (from the repository root)
python -m pyingestion <input_dir> [options]

Positional Arguments

<input_dir>: Path to the directory containing files to process.

Options

-o, --output <path>: Custom output file or database path (Default: output.csv in your working directory).
-g, --regex <path>: Path to a JSON/TOML file containing customized regex extraction rules.
-r, --recursive: Search for files recursively within subdirectories.
--resume: Resume processing using checkpoint data from .gaia_resume.json.
-t, --test <file_path>: Test your regex rules on the first page of the provided file.
-p, --pages-per-unit <int>: The number of pages/chunks grouped together as a single block for extraction matching (Default: 1).
-l, --lang {"en", "pt"}: Force the interface language to English or Portuguese (Default: en).
--type {"pdf", "docx", "ocr"}: Define the built-in parser type to use (Default: pdf).
--to {"csv", "sqlite", "mysql"}: Force output destination type (Default: csv).

Examples

Basic processing run:

pyingestion /path/to/pdfs -g rules.json

Resume an interrupted run:
```
pyingestion /path/to/pdfs --resume
```

Test matching logic on a single file:

pyingestion -t sample.pdf -g rules.json

Configuration Files Layout

You can configure options and pipelines declaratively using a JSON or TOML file via the -c or --config parameter.

1. Basic Configuration Format (Root level or [config] section)

To declare basic CLI options:

# config.toml
input_dir = "poc/pdfs"
output = "poc/resultados.csv"
regex = "poc/rules.toml"
to = "csv"

Or under a [config] section:

# config.toml
[config]
input_dir = "poc/pdfs"
output = "poc/resultados.csv"
regex = "poc/rules.toml"

2. Advanced Declarative Pipelines

To define inputs, transforms, and outputs dynamically:

# pipeline.toml
input_dir = "poc/pdfs"

[input]
type = "pdf"
pages_per_unit = 2

[transform]
type = "regex"
config_file = "rules.toml"

[output]
type = "sqlite"
db_path = "records.db"
table_name = "pdf_records"

You can also define multiple transforms and outputs (e.g. to write to both CSV and SQLite):

# multi_pipeline.toml
input_dir = "poc/pdfs"

[input]
type = "pdf"

[[transform]]
type = "regex"
config_file = "rules.toml"

[[output]]
type = "sqlite"
db_path = "records.db"
table_name = "invoices"

[[output]]
type = "csv"
path = "backup.csv"

🧪 Testing and Tools

Running the Test Suite

The unit and integration tests validate CLI logic, parser fallbacks, observers, and settings parsing.

./tools/linux/run_tests.sh

Compiling Localization Catalogs

To re-compile updated translation dictionary catalogs (.po) to gettext binary files (.mo):

./tools/linux/compile_locales.sh

Project details

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.5.3b1 pre-release

Jun 17, 2026

This version

0.5.2b1 pre-release

Jun 15, 2026

0.5.1b1 pre-release

Jun 15, 2026

0.5.0b1 pre-release

Jun 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyingestion-0.5.2b1.tar.gz (45.2 kB view details)

Uploaded Jun 15, 2026 Source

File details

Details for the file pyingestion-0.5.2b1.tar.gz.

File metadata

Download URL: pyingestion-0.5.2b1.tar.gz
Upload date: Jun 15, 2026
Size: 45.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for pyingestion-0.5.2b1.tar.gz
Algorithm	Hash digest
SHA256	`a5a96acce274441fe8dea8db59d7e5f6cd1a63509e6880f5960cd957123c1b39`
MD5	`5a7323d66e6eee0d6d3222f6a06ba80b`
BLAKE2b-256	`c48ccbc7f6d4240bd3b468227efb9da0ef2166c4f467a407d330f919f24cba01`

See more details on using hashes here.

pyingestion 0.5.2b1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

PyIngestion (Codename: Gaia) — Generalized Document Data Extractor

🚀 Key Features

📁 Project Directory Structure

🛠️ Requirements & Installation

Prerequisites

Environment Setup & Packaging

💻 Usage

1. As a Python Library

Orchestrating the Full Pipeline Programmatically

Creating & Injecting a Custom Input Stream

Using Input Stream and Engine Components Directly

2. Command-Line Interface (CLI)

Positional Arguments

Options

Examples

Configuration Files Layout

1. Basic Configuration Format (Root level or [config] section)

2. Advanced Declarative Pipelines

🧪 Testing and Tools

Running the Test Suite

Compiling Localization Catalogs

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes