Skip to main content

Advanced PDF layout analysis engine for extracting figures, tables, and structured content

Project description

Quanta Logo

Quanta

Advanced PDF Layout Analysis Engine

Python 3.8+ License: MIT Code style: black Stars

Issues Discussions


A powerful and intelligent PDF layout analysis engine that automatically extracts figures, tables, and structured content from PDF documents using advanced computer vision and machine learning techniques.

๐ŸŽฏ Problem Solved: Complex engineering documents often lose critical visual information (figures, diagrams, technical drawings) when being parsed by traditional PDF tools. This engine specifically addresses the challenge of accurately detecting and extracting visual elements from technical and engineering PDFs that contain intricate layouts, multi-column designs, and embedded graphics.

Layout Analysis Debug Overlay

Debug overlay showing detected layout elements: columns (blue), text blocks (green), figures (red), and tables (yellow)

โœจ Features

  • ๐Ÿ” Multi-column Layout Detection - Automatically identifies and processes complex multi-column layouts
  • ๐Ÿ“Š Intelligent Table Recognition (Mistral OCR) - Extracts tables and text with high accuracy via Mistral Document OCR
  • ๐Ÿ–ผ๏ธ Figure Extraction (Custom) - Identifies and extracts figures, diagrams, and images using custom algorithms
  • ๐Ÿ“ Text Block Analysis (Mistral + Heuristics) - Uses Mistral OCR output and in-house grouping for reading order
  • ๐Ÿท๏ธ Caption Linking - Automatically links captions to their corresponding figures and tables
  • ๐ŸŽฏ High Accuracy - Advanced algorithms ensure reliable content extraction
  • โšก Fast Processing - Optimized for speed and efficiency
  • ๐Ÿ› ๏ธ Easy Integration - Simple API for integration into existing workflows
  • ๐Ÿ”ง Debug Mode - Visualize layout analysis with overlay images

๐Ÿš€ Quick Start

Install via PyPI

pip install quanta-pdf

Basic Usage (Python)

from quanta import extract_document

result = extract_document("document.pdf", "output/")
print(f"Pages: {len(result['pages'])}")

Command Line Interface

quanta --input document.pdf --output output/

If you want Mistral OCR tables/text, set MISTRAL_API_KEY first (see below).

Environment configuration (.env)

To enable Mistral OCR for tables and text blocks, set your API key. You can either export it or place it in a .env file at your project root.

# Option A: environment variable
export MISTRAL_API_KEY="your-mistral-api-key"

# Option B: .env file (same directory where you run the code)
echo "MISTRAL_API_KEY=your-mistral-api-key" > .env

The library loads .env automatically; the CLI also picks it up when run from that directory.

๐Ÿ“– Documentation

Core Concepts

Layout Analysis Pipeline

The engine follows a sophisticated multi-stage pipeline:

  1. PDF Rendering - Converts PDF pages to high-resolution images
  2. Column Detection - Identifies multi-column layouts using whitespace analysis
  3. Text Extraction - Extracts and groups text blocks
  4. Figure Detection - Identifies figures using vector clustering and image analysis
  5. Table & Text Recognition (Mistral OCR) - Leverages Mistral Document OCR to extract tables (CSV) and text blocks
  6. Caption Linking - Links captions to their corresponding figures/tables
  7. Reading Order - Determines proper reading sequence

Mathematical Foundations

Column Detection Algorithm:

  • Uses whitespace valley analysis to identify column boundaries
  • Applies Gaussian smoothing to detect consistent vertical gaps
  • Implements adaptive thresholding for varying document layouts

Table/Text Extraction:

  • Uses Mistral Document OCR to obtain markdown-like structured output
  • Parses tables into CSV files and groups text into blocks

Figure Detection:

  • Vector clustering using DBSCAN algorithm
  • Aspect ratio analysis to distinguish figures from tables
  • Image XObject extraction for embedded graphics

API Reference (package)

extract_document(input_pdf: str | Path, output_dir: str | Path) -> dict

Process a PDF document and extract structured content.

Parameters:

  • input_pdf: Path to the input PDF file
  • output_dir: Directory to save extracted content

Returns:

  • Dict[str, Any]: Processing results containing figures, tables, and metadata

Example:

from quanta import extract_document
result = extract_document("research_paper.pdf", "output/")
print(result["summary_path"])  # JSON summary path

๐ŸŽฏ Use Cases

Engineering & Technical Documents

  • Technical Drawings: Extract engineering diagrams and CAD drawings
  • Specification Sheets: Parse technical specifications and data tables
  • Engineering Reports: Process complex multi-column technical reports
  • Manufacturing Docs: Extract assembly instructions and part diagrams

Academic Research

  • Extract figures and tables from research papers
  • Analyze document structure and layout
  • Process large collections of academic PDFs

Document Digitization

  • Convert PDF documents to structured data
  • Extract content for database storage
  • Prepare documents for text analysis

Content Management

  • Automatically categorize document content
  • Extract metadata and captions
  • Generate document summaries

Data Analysis

  • Extract tabular data from reports
  • Process financial documents
  • Analyze technical specifications

๐Ÿ”ง Advanced Configuration

Custom Parameters

from pdf_layout_engine import process_pdf

# Custom processing parameters
config = {
    'min_figure_area': 1000,
    'table_detection_threshold': 0.7,
    'column_detection_sensitivity': 0.8
}

result = process_pdf("document.pdf", "output/", config=config)

Debug Mode

Enable debug mode to visualize the layout analysis process:

python main.py --debug

This generates overlay images showing:

  • ๐ŸŸฆ Blue rectangles: Column boundaries
  • ๐ŸŸข Green rectangles: Text blocks
  • ๐ŸŸฅ Red rectangles: Figures
  • ๐ŸŸก Yellow rectangles: Tables

Output Structure

Results are organized per page under the PDF name inside output/.

Example:

output/<pdf_name>/
โ”œโ”€โ”€ page_01/
โ”‚   โ”œโ”€โ”€ figures/
โ”‚   โ”‚   โ””โ”€โ”€ figure_01.png
โ”‚   โ”œโ”€โ”€ tables/
โ”‚   โ”‚   โ””โ”€โ”€ table_01.csv          # tables saved as CSV only (no table PNGs)
โ”‚   โ”œโ”€โ”€ text/
โ”‚   โ”‚   โ””โ”€โ”€ text_blocks.txt       # text blocks from Mistral OCR
โ”‚   โ””โ”€โ”€ page_01.png               # full page image
โ”œโ”€โ”€ page_02/
โ”‚   โ””โ”€โ”€ ...
โ”œโ”€โ”€ page_XX_debug_overlay.png     # debug overlay for each processed page (at root)
โ””โ”€โ”€ summary.json                  # high-level summary (counts, filenames)

Key points:

  • Tables are saved as CSV files only (no table images).
  • Figures are cropped from the page using custom detection and saved as PNGs.
  • Text blocks (from Mistral OCR) are written to text/text_blocks.txt per page.
  • A full-page PNG is saved in each page_XX/ directory.
  • Debug overlays (page_XX_debug_overlay.png) are saved at the PDF root inside output/<pdf_name>/.

๐Ÿ“Š Performance

Current Benchmarks

  • Processing Speed: ~2-5 seconds per page
  • Current Accuracy: ~80% for figures and tables
  • Memory Usage: ~200MB for typical documents
  • Supported Formats: PDF 1.4 - PDF 2.0

๐Ÿšง Active Development

We're currently fine-tuning our base models to improve accuracy. The engine is in active development with regular updates to enhance detection performance. We're working towards achieving 90%+ accuracy through:

  • Model fine-tuning on engineering document datasets
  • Improved preprocessing pipelines
  • Enhanced feature extraction algorithms
  • Community feedback integration

Optimization Tips

  • Use high-resolution rendering for better accuracy
  • Adjust parameters based on document type
  • Process pages in parallel for batch operations
  • Use debug mode to tune detection parameters

๐Ÿ–ผ๏ธ Examples

Debug Overlay Analysis

Layout Analysis Debug Overlay

Debug overlay showing detected layout elements: columns (red), text blocks (green), figures (blue), and tables (yellow)

Engineering Document Processing

Engineering Document Analysis

Complex engineering document with multi-column layout and technical drawings

Extracted Figure

Extracted Figure

Automatically extracted figure from PDF document

Extracted Table

Extracted Table

Automatically extracted table with preserved formatting

Multi-Page Analysis

Multi-page Analysis

Consistent layout analysis across multiple pages of technical documents

๐Ÿ‘ฅ Contributors

Developers & Maintainers:

๐Ÿค Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Setup

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest

# Run with coverage
pytest --cov=src

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Built with PyMuPDF for PDF processing
  • Uses OpenCV for computer vision operations
  • Inspired by research in document layout analysis

๐Ÿ“ž Support


Made with โค๏ธ for the open source community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quanta_pdf-1.0.5.tar.gz (61.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

quanta_pdf-1.0.5-py3-none-any.whl (63.0 kB view details)

Uploaded Python 3

File details

Details for the file quanta_pdf-1.0.5.tar.gz.

File metadata

  • Download URL: quanta_pdf-1.0.5.tar.gz
  • Upload date:
  • Size: 61.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for quanta_pdf-1.0.5.tar.gz
Algorithm Hash digest
SHA256 dc81ee26feeace3112af701a43fd1f2ca19ac9366c5c29e26db1daa771144d83
MD5 2641589f0e9e853eccf362f6a6e8c0d8
BLAKE2b-256 f77463d8c75da5c31c1b8606ef11d0071206149a27ef42700b3b00e26109c91e

See more details on using hashes here.

File details

Details for the file quanta_pdf-1.0.5-py3-none-any.whl.

File metadata

  • Download URL: quanta_pdf-1.0.5-py3-none-any.whl
  • Upload date:
  • Size: 63.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for quanta_pdf-1.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 775cba15d41365229edcce0c84801091c04b96f184f3a116c7f7c9d446f24618
MD5 60a3279882aa703ee99b5ed0adb293aa
BLAKE2b-256 117e54a7260c9c892730186a05cf589380b1a8ec9dea2595d1e13761e3bb6af0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page