abstract-pdfs

A modular OCR and PDF-processing toolkit for automated text extraction, deduplication, and multi-engine column-aware OCR using Tesseract, EasyOCR, and PaddleOCR

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
- Python :: 3.11

Project description

Abstract PDFs

Abstract PDFs is a modular OCR and PDF-processing toolkit built for automation pipelines. It provides a structured way to ingest, deduplicate, split, and extract text from PDF documents — including column-aware OCR through multiple engines (PaddleOCR, EasyOCR, and Tesseract).

Designed to integrate seamlessly with other Abstract modules (like abstract_ocr and abstract_utilities), it forms the foundation for scalable document analysis, digital archiving, and machine learning dataset preparation.

Features
Installation
Usage
Architecture
Classes
Dependencies
Example Workflow
License

Architecture

`Architecture

This module is organized in a straightforward way. It consists of two main folders: 'pdf_utils', responsible for various PDF handling operations, and 'imports', managing the necessary import functions.

├── home
│   └── computron
│       └── Documents
│           └── pythonTools
│               └── modules
│                   └── src
│                       └── modules
│                           └── abstract_pdfs
│                               ├── abstract_pdfs
│                               │   ├── AbstractPDFManager.py # Primary module for managing PDF operations
│                               │   ├── SliceManager.py # Module for handling slice operations
│                               │   ├── imports
│                               │   │   ├── imports.py # General import functions used across the module
│                               │   │   ├── manifest_utils.py # Utility functions for manifest handling
│                               │   ├── pdf_utils
│                               │   │   ├── imports.py # Import functions for PDF utilities
│                               │   │   ├── pdf_to_text.py # Converts PDF files to text
│                               │   │   ├── pdf_tools.py # Utility functions for PDF manipulation
│                               │   ├── __init__.py # Initializer file for the abstract_pdfs module
│                               ├── __init__.py # Initializer file for the outer structure
│
└── # End of structure

The structure provided allows for modularity and separation of concerns. Each Python file serves a specific purpose, like converting PDFs to text, or managing PDF operations. This makes the module easy to maintain and extend.

Classes

Classes & API

The utilized classes and their key methods in this module are as follows:

AbstractPDFManager: This class manages various PDF operations.
- convert_pdf_to_image(): Converts PDF files to images.
- extract_text_from_pdf(): Extracts text from PDF files.
- split_pdf(): Splits a PDF into separate pages.
SliceManager: This class handles slice operations.
- generate_slices(): Generates slices from an image.
- save_slices(): Saves the generated slices to a directory.
PDFTools: This class contains utility functions for PDF manipulation.
- merge_pdfs(): Merges multiple PDFs into a single PDF.
- rotate_pdf(): Rotates pages in a PDF.
- resize_pdf(): Resizes pages in a PDF.

These classes are designed to achieve modularity and separation of concerns, each serving a specific purpose like converting PDFs to text, splitting PDFs into separate pages, or managing slice operations. The methods contained within these classes provide easy access to the available functionalities of the module. The module encourages code reuse and simplifies complex tasks related to PDFs and image processing which ultimately makes it easy to maintain and extend.

Features

PDF to Image Conversion: Convert PDF files to images using the convert_pdf_to_image() method, which works with the PIL and pdf2image libraries.
Text Extraction from PDFs: Extract text from PDF files with the extract_text_from_pdf() method. The package relies on abstract_ocr for this functionality.
PDF Manipulation: Aggregate operations like splitting, merging, rotating, and resizing PDF files are possible using the split_pdf(), merge_pdfs(), rotate_pdf(), and resize_pdf() methods.
Slice Management: Generate and save slices from an image with generate_slices() and save_slices() methods. This feature is integral to the OCR process.
Modular Architecture: The architecture of this module is designed to be modular, which promotes code reuse and simplifies complex tasks related to PDFs and image processing.
Compatibility: The module requires Python 3.6 or higher, supporting compatibility with modern Python versions.

Installation

Prerequisites

The abstract_pdfs module requires Python 3.6 or higher for working compatibility. Also, make sure you have the following system libraries installed:

poppler-utils
tesseract

Installation with pip

You can install abstract_pdfs via pip.

pip install abstract_pdfs

Installation from source

If you prefer to install from the source, you can clone the repository and use pip to handle the installation:

# Clone the repository

git clone https://github.com/AbstractEndeavors/abstract_pdfs.git

# Navigate to the project directory

cd abstract_pdfs

# Install the package

pip install .

License and Summary

The abstract_pdfs module is released under the MIT License and is authored by putkoff. An abstract endeavor, this module enables powerful and flexible handling of PDF operations, ranging from conversion of PDFs into images, text extraction from PDF files, to a rich variety of aggregate operations like splitting, merging, rotating, and resizing PDF files.

For more information on this module, visit the official repository here. For other abstract projects, refer to the AbstractEndeavors Github page.

Overview

`abstract_pdfs` - Powerful PDF Handling for the Modern Python Developer - Version 0.0.0.001

Welcome to the documentation for the abstract_pdfs Python module. Authored by putkoff and maintained by AbstractEndeavors, this module is a part of a larger ecosystem of Python tools designed for tackling a host of programming challenges.

The primary purpose of abstract_pdfs is to provide developers with a powerful, flexible interface for managing PDF files. With dependency packages like PIL, abstract_ocr, and abstract_utilities, this module allows you to convert PDFs to images, extract text from PDFs, and perform various aggregate operations such as splitting, merging, rotating, and resizing PDF files.

The abstract_pdfs module fits within the broader Abstract ecosystem as a go-to solution for PDF management. Its functionality synergistically integrates with modules like abstract_ocr for optical character recognition, leveraging the power of the Abstract tools collection.

This module is in Alpha stage (Development Status 3) and is ready for integration by developers. It requires Python 3.6 or higher for optimal use. You will find further details on installation and features in the subsequent sections of this README.

Usage

Here is a basic python sample of how to use the abstract_pdfs module.

# Import required modules
from abstract_pdfs import PdfHandler

# Specify the required PDF file
path_to_pdf = "/path/to/your/pdf"

# Create an instance of PdfHandler
pdf_handler = PdfHandler(path_to_pdf)

# Now you can perform various operations like
# Converting PDF to Image
image_path = pdf_handler.to_image()

# Extract text from PDF
text = pdf_handler.extract_text()

# The extracted text will be in string format
print(text)

The abstract_pdfs module also comes with built-in support for handling multiple PDF files at once.

# Import required modules
from abstract_pdfs import PdfHandler

# Specify a list of PDF files
paths_to_pdfs = ["/path/to/your/pdf1", "/path/to/your/pdf2", "/path/to/your/pdf3"]

# Create instances of PdfHandler in a single line
pdf_handlers = [PdfHandler(path) for path in paths_to_pdfs]

# Now you can iterate over pdf_handlers to do various operations. For example:
for handler in pdf_handlers:
    # Print the number of pages in each PDF
    print(handler.number_of_pages)

These are just a few basic usage examples. The abstract_pdfs module exposes a rich API for manipulating and interrogating PDF files. Check out the API documentation for a full list of available methods and their descriptions.'

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
- Python :: 3.11

Release history Release notifications | RSS feed

0.0.33

Apr 6, 2026

0.0.32

Mar 28, 2026

This version

0.0.31

Mar 28, 2026

0.0.30

Mar 28, 2026

0.0.29

Mar 28, 2026

0.0.28

Mar 28, 2026

0.0.27

Mar 28, 2026

0.0.26

Mar 28, 2026

0.0.25

Mar 28, 2026

0.0.24

Mar 28, 2026

0.0.23

Mar 28, 2026

0.0.22

Mar 28, 2026

0.0.21

Mar 17, 2026

0.0.20

Mar 17, 2026

0.0.19

Mar 17, 2026

0.0.18

Mar 16, 2026

0.0.17

Mar 15, 2026

0.0.16

Mar 15, 2026

0.0.15

Mar 15, 2026

0.0.14

Mar 15, 2026

0.0.13

Mar 15, 2026

0.0.12

Mar 15, 2026

0.0.11

Mar 15, 2026

0.0.10

Mar 15, 2026

0.0.9

Mar 15, 2026

0.0.8

Mar 12, 2026

0.0.7

Mar 11, 2026

0.0.6

Mar 11, 2026

0.0.5

Mar 10, 2026

0.0.4

Oct 21, 2025

0.0.3

Oct 21, 2025

0.0.2

Oct 21, 2025

0.0.1

Oct 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

abstract_pdfs-0.0.31.tar.gz (57.0 kB view details)

Uploaded Mar 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

abstract_pdfs-0.0.31-py3-none-any.whl (75.2 kB view details)

Uploaded Mar 28, 2026 Python 3

File details

Details for the file abstract_pdfs-0.0.31.tar.gz.

File metadata

Download URL: abstract_pdfs-0.0.31.tar.gz
Upload date: Mar 28, 2026
Size: 57.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for abstract_pdfs-0.0.31.tar.gz
Algorithm	Hash digest
SHA256	`dab2fd74d1f9c4088b2e3c337ae6713d014a523e871d16d752acb558f36847f9`
MD5	`8208553eaa98c118edee7a06adb58974`
BLAKE2b-256	`34773f2ee223b331715556a466ac0645d6d6d500372eb972e7bd13a19218ba55`

See more details on using hashes here.

File details

Details for the file abstract_pdfs-0.0.31-py3-none-any.whl.

File metadata

Download URL: abstract_pdfs-0.0.31-py3-none-any.whl
Upload date: Mar 28, 2026
Size: 75.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for abstract_pdfs-0.0.31-py3-none-any.whl
Algorithm	Hash digest
SHA256	`76ef208c9a8d3dec20488398256c00d889b0b1d90a6c53bcdc26c01817f3dac5`
MD5	`059fc8329f25b4f640dd874e42f029dd`
BLAKE2b-256	`c0043eae03f3e78e63924bfca50ba145e036a49c23c51f6145f4955132c5313e`

See more details on using hashes here.

abstract-pdfs 0.0.31

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Abstract PDFs

Table of Contents

Architecture

Classes

Classes & API

Features

Installation

Prerequisites

Installation with pip

Installation from source

License and Summary

Overview

`abstract_pdfs` - Powerful PDF Handling for the Modern Python Developer - Version 0.0.0.001

Usage

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

abstract-pdfs 0.0.31

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Abstract PDFs

Table of Contents

Architecture

Classes

Classes & API

Features

Installation

Prerequisites

Installation with pip

Installation from source

License and Summary

Overview

abstract_pdfs - Powerful PDF Handling for the Modern Python Developer - Version 0.0.0.001

Usage

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`abstract_pdfs` - Powerful PDF Handling for the Modern Python Developer - Version 0.0.0.001