General-purpose document to CSV data extraction library.
Project description
PyDocStructurer (Codename: Gaia) โ Generalized Document Data Extractor
PyDocStructurer (project codename Gaia) is a versatile and robust document data extraction system designed to retrieve structured key-value pair (KVP) records from text and files. It is packaged both as a programmatic Python library (pydocstructurer) and a feature-rich command-line tool (CLI).
PyDocStructurer uses a modular architecture using fast native text extraction and an extensible parser interface to ensure high speed, fidelity, and future adaptability to new file formats.
๐ Key Features
- Dual-Purpose Design:
- Programmatic Library: Integrate the
RegexEngine, built-in or customParsercomponents, and observers directly into your own codebase. - Command-Line Interface: Run parsing pipelines directly from your shell with dynamic dashboards, detailed progress tracking, and configurable execution.
- Programmatic Library: Integrate the
- Extensible Parser Architecture:
- Fully decoupled document discovery and data extraction. Programmatic users can write and inject custom parsers (e.g., Docx, OCR, XML) by subclassing the abstract
Parserclass.
- Fully decoupled document discovery and data extraction. Programmatic users can write and inject custom parsers (e.g., Docx, OCR, XML) by subclassing the abstract
- Fast Native PDF Processing:
- Employs fast native layout-based PDF text extraction (via
pypdf) as a built-in default parser.
- Employs fast native layout-based PDF text extraction (via
- Dynamic Terminal Interface (TUI):
- Real-time metrics rendered via
rich.live. - Live status dashboard featuring counters for processed files, pages, failures, and a progress bar with numerical Estimated Time of Arrival (ETA).
- Real-time metrics rendered via
- Robust Session Resume:
- Automatically checkpoints progress using a state file (
.gaia_resume.json). If interrupted, the--resumeflag lets you pick up right where you left off.
- Automatically checkpoints progress using a state file (
- Custom Regex Configurations:
- Supply custom pattern matching rules via a JSON configuration file.
- Multi-Page Unit Grouping:
- Group multiple pages as a single unit using
--pages-per-unitfor patterns that span across page boundaries.
- Group multiple pages as a single unit using
- Internationalization (i18n):
- Complete user interface and message translation support for English (
en) and Portuguese (pt).
- Complete user interface and message translation support for English (
- Graceful Interrupt Handlers:
- Supports clean cancellation via
ESCorCtrl+C, ensuring resources, files, and terminal settings are restored safely.
- Supports clean cancellation via
๐ Project Directory Structure
Gaia/
โโโ pydocstructurer/
โ โโโ __init__.py # Main entry points exposing library API classes
โ โโโ __main__.py # Main entry point for python -m pydocstructurer
โ โโโ cli/
โ โ โโโ __init__.py # CLI subpackage initialization
โ โ โโโ cli_helper.py # CLI arguments parser and prevalidation helper
โ โ โโโ terminal_ui.py # Rich TUI display and keyboard input handling
โ โโโ pydocstructurer.py # Main global program class (PyDocStructurer, codename: Gaia)
โ โโโ extraction_session.py# Session progress tracking & state serialization
โ โโโ options.py # Config options container class & parameter validations
โ โโโ parser.py # Abstract Parser base, ParserType Enum, and ParserFactory
โ โโโ i18n.py # Gettext wrappers and language initialization
โ โโโ locale/ # Compiled translations directory
โ โ โโโ en/LC_MESSAGES/messages.mo
โ โ โโโ pt/LC_MESSAGES/messages.mo
โ โโโ observer.py # Progress notification interface (observer pattern)
โ โโโ output_stream.py # Output stream interfaces (OutputStream, CsvWriteStream, DefaultOutputStream)
โ โโโ pdf_parser.py # Native PDF Parser implementation
โ โโโ regex_engine.py # Abstracted matching engine
โ โโโ main.py # CLI entry point implementation
โโโ pyproject.toml # Setuptools PEP 621 packaging definitions
โโโ requirements.txt # Package requirements
โโโ tests/ # Extensive test suites
โโโ tools/
โโโ linux/
โโโ compile_locales.sh # Compiles Translation Catalog (.po -> .mo)
โโโ run_tests.sh # Script to execute unittest suite
๐ ๏ธ Requirements & Installation
Prerequisites
- Python 3.10+
Environment Setup & Packaging
-
Clone or navigate to the repository:
cd Trabalho/Gaia
-
Setup virtual environment:
python -m venv .venv source .venv/bin/activate
-
Install the package in editable mode:
pip install -e .
๐ป Usage
1. As a Python Library
You can integrate PyDocStructurer directly into your Python scripts.
Orchestrating the Full Pipeline Programmatically
To execute the entire extraction pipeline on a file or directory:
from pydocstructurer import PyDocStructurer, Options
# 1. Configure options programmatically
options = Options()
options.BASE_PATH = "path/to/pdfs"
options.REGEX_FILE = "path/to/rules.json"
options.OUTPUT_CSV = "custom_output.csv"
options.PAGES_PER_UNIT = 1
# 2. Run the orchestrator
controller = PyDocStructurer(options)
success = controller.run()
Creating & Injecting a Custom Parser
You can supply your own extraction parser format by subclassing the abstract base class Parser:
from typing import Generator
from pydocstructurer import PyDocStructurer, Options, Parser, ExtractionSession
class CustomTxtParser(Parser):
def accepts(self, file_path: str) -> bool:
# Define what files this parser accepts
return file_path.lower().endswith(".txt")
def process_file(
self,
file_path: str,
session: ExtractionSession | None = None,
pages_per_unit: int = 1
) -> Generator[tuple[int, int, str], None, None]:
# Process the file and yield: (unit_index, total_units, content_text)
with open(file_path, "r", encoding="utf-8") as f:
content = f.read()
yield 1, 1, content
# Inject it into PyDocStructurer orchestrator
options = Options()
options.BASE_PATH = "path/to/text/files"
options.REGEX_FILE = "rules.json"
controller = PyDocStructurer(options, parser=CustomTxtParser())
controller.run()
Using Parser and Engine Components Directly
To parse files manually and match patterns page-by-page:
from pydocstructurer import PdfParser, NativeRegexEngine
# 1. Setup the Regex engine with rules in-memory (dictionary)
regex_rules = {
"infraction_id": {
"regex": r"Cรณdigo da Infraรงรฃo:\s*([A-Za-z0-9-]+)",
"required": True
},
"plate": {
"regex": r"Placa:\s*([A-Z]{3}-?\d[A-Z0-9]\d{2})",
"required": True
}
}
engine = NativeRegexEngine(regex_rules)
# Alternatively, load rules from a JSON file path:
# engine = NativeRegexEngine.from_file("path/to/rules.json")
# 2. Setup the parser
parser = PdfParser()
# 3. Process files programmatically
# The parser yields raw text segments for each page/unit.
# You then normalize the text and parse it using the RegexEngine.
for unit_index, total_units, raw_text in parser.process_file("path/to/infraction.pdf", pages_per_unit=1):
record = engine.parse(raw_text)
print("Parsed Record:", record)
2. Command-Line Interface (CLI)
PyDocStructurer can be executed directly as a global shell command, as a python module run, or as a local script.
# 1. As a global command (after package installation)
pydocstructurer <input_dir> [options]
# 2. As a python module run (from the repository root)
python -m pydocstructurer <input_dir> [options]
Positional Arguments
<input_dir>: Path to the directory containing files to process.
Options
-o,--output<path>: Custom output CSV file path (Default:output.csvin your working directory).-g,--regex<path>: Path to a JSON file containing customized regex extraction rules.-r,--recursive: Search for files recursively within subdirectories.--resume: Resume processing using checkpoint data from.gaia_resume.json.-t,--test<file_path>: Test your regex rules on the first page of the provided file.-p,--pages-per-unit<int>: The number of pages/chunks grouped together as a single block for extraction matching (Default:1).-l,--lang{"en", "pt"}: Force the interface language to English or Portuguese (Default:en).--type{"pdf"}: Define the built-in parser type to use (Default:pdf).
Examples
-
Basic processing run:
pydocstructurer /path/to/pdfs -g rules.json
-
Resume an interrupted run:
pydocstructurer /path/to/pdfs --resume
-
Test matching logic on a single file:
pydocstructurer -t sample.pdf -g rules.json
๐งช Testing and Tools
Running the Test Suite
The unit and integration tests validate CLI logic, parser fallbacks, observers, and settings parsing.
./tools/linux/run_tests.sh
Compiling Localization Catalogs
To re-compile updated translation dictionary catalogs (.po) to gettext binary files (.mo):
./tools/linux/compile_locales.sh
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pydocstructurer-1.0.3.tar.gz.
File metadata
- Download URL: pydocstructurer-1.0.3.tar.gz
- Upload date:
- Size: 34.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7fe957c93bdaae38c04b579d1c9d6cd426862f970fabd1f41752d28af9956909
|
|
| MD5 |
361b454417dfd137cd1577e95ad4598a
|
|
| BLAKE2b-256 |
a8d143151bb5938f5f4f3e1358c8e4a08a533091602b5e28f6e03229d2f16b2c
|
File details
Details for the file pydocstructurer-1.0.3-py3-none-any.whl.
File metadata
- Download URL: pydocstructurer-1.0.3-py3-none-any.whl
- Upload date:
- Size: 32.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ca0c4b695b967553660946b22bc92cd18b1b29dffa3e4b4bd7aea9b054796a85
|
|
| MD5 |
010c9ca891200ef48844d1859b8aa798
|
|
| BLAKE2b-256 |
70a8666d7deb97bc543a22dfb6e27fef59c8be34b5a30b08ac84dd1fa713c962
|