Advanced PDF layout analysis engine for extracting figures, tables, and structured content

These details have not been verified by PyPI

Project links

Project description

Quanta

Advanced PDF Layout Analysis Engine

A powerful and intelligent PDF layout analysis engine that automatically extracts figures, tables, and structured content from PDF documents using advanced computer vision and machine learning techniques.

🎯 Problem Solved: Complex engineering documents often lose critical visual information (figures, diagrams, technical drawings) when being parsed by traditional PDF tools. This engine specifically addresses the challenge of accurately detecting and extracting visual elements from technical and engineering PDFs that contain intricate layouts, multi-column designs, and embedded graphics.

Debug overlay showing detected layout elements: columns (blue), text blocks (green), figures (red), and tables (yellow)

✨ Features

🔍 Multi-column Layout Detection - Automatically identifies and processes complex multi-column layouts
📊 Intelligent Table Recognition (Mistral OCR) - Extracts tables and text with high accuracy via Mistral Document OCR
🖼️ Figure Extraction (Custom) - Identifies and extracts figures, diagrams, and images using custom algorithms
📝 Text Block Analysis (Mistral + Heuristics) - Uses Mistral OCR output and in-house grouping for reading order
🏷️ Caption Linking - Automatically links captions to their corresponding figures and tables
🎯 High Accuracy - Advanced algorithms ensure reliable content extraction
⚡ Fast Processing - Optimized for speed and efficiency
🛠️ Easy Integration - Simple API for integration into existing workflows
🔧 Debug Mode - Visualize layout analysis with overlay images

🚀 Quick Start

Install via PyPI

pip install quanta-pdf

Basic Usage (Python)

from quanta import extract_document

result = extract_document("document.pdf", "output/")
print(f"Pages: {len(result['pages'])}")

Command Line Interface

quanta --input document.pdf --output output/

If you want Mistral OCR tables/text, set MISTRAL_API_KEY first (see below).

Environment configuration (.env)

To enable Mistral OCR for tables and text blocks, set your API key. You can either export it or place it in a .env file at your project root.

# Option A: environment variable
export MISTRAL_API_KEY="your-mistral-api-key"

# Option B: .env file (same directory where you run the code)
echo "MISTRAL_API_KEY=your-mistral-api-key" > .env

The library loads .env automatically; the CLI also picks it up when run from that directory.

📖 Documentation

Core Concepts

Layout Analysis Pipeline

The engine follows a sophisticated multi-stage pipeline:

PDF Rendering - Converts PDF pages to high-resolution images
Column Detection - Identifies multi-column layouts using whitespace analysis
Text Extraction - Extracts and groups text blocks
Figure Detection - Identifies figures using vector clustering and image analysis
Table & Text Recognition (Mistral OCR) - Leverages Mistral Document OCR to extract tables (CSV) and text blocks
Caption Linking - Links captions to their corresponding figures/tables
Reading Order - Determines proper reading sequence

Mathematical Foundations

Column Detection Algorithm:

Uses whitespace valley analysis to identify column boundaries
Applies Gaussian smoothing to detect consistent vertical gaps
Implements adaptive thresholding for varying document layouts

Table/Text Extraction:

Uses Mistral Document OCR to obtain markdown-like structured output
Parses tables into CSV files and groups text into blocks

Figure Detection:

Vector clustering using DBSCAN algorithm
Aspect ratio analysis to distinguish figures from tables
Image XObject extraction for embedded graphics

API Reference (package)

`extract_document(input_pdf: str | Path, output_dir: str | Path) -> dict`

Process a PDF document and extract structured content.

Parameters:

input_pdf: Path to the input PDF file
output_dir: Directory to save extracted content

Returns:

Dict[str, Any]: Processing results containing figures, tables, and metadata

Example:

from quanta import extract_document
result = extract_document("research_paper.pdf", "output/")
print(result["summary_path"])  # JSON summary path

🎯 Use Cases

Engineering & Technical Documents

Technical Drawings: Extract engineering diagrams and CAD drawings
Specification Sheets: Parse technical specifications and data tables
Engineering Reports: Process complex multi-column technical reports
Manufacturing Docs: Extract assembly instructions and part diagrams

Academic Research

Extract figures and tables from research papers
Analyze document structure and layout
Process large collections of academic PDFs

Document Digitization

Convert PDF documents to structured data
Extract content for database storage
Prepare documents for text analysis

Content Management

Automatically categorize document content
Extract metadata and captions
Generate document summaries

Data Analysis

Extract tabular data from reports
Process financial documents
Analyze technical specifications

🔧 Advanced Configuration

Custom Parameters

from pdf_layout_engine import process_pdf

# Custom processing parameters
config = {
    'min_figure_area': 1000,
    'table_detection_threshold': 0.7,
    'column_detection_sensitivity': 0.8
}

result = process_pdf("document.pdf", "output/", config=config)

Debug Mode

Enable debug mode to visualize the layout analysis process:

python main.py --debug

This generates overlay images showing:

🟦 Blue rectangles: Column boundaries
🟢 Green rectangles: Text blocks
🟥 Red rectangles: Figures
🟡 Yellow rectangles: Tables

Output Structure

Results are organized per page under the PDF name inside output/.

Example:

output/<pdf_name>/
├── page_01/
│   ├── figures/
│   │   └── figure_01.png
│   ├── tables/
│   │   └── table_01.csv          # tables saved as CSV only (no table PNGs)
│   ├── text/
│   │   └── text_blocks.txt       # text blocks from Mistral OCR
│   └── page_01.png               # full page image
├── page_02/
│   └── ...
├── page_XX_debug_overlay.png     # debug overlay for each processed page (at root)
└── summary.json                  # high-level summary (counts, filenames)

Key points:

Tables are saved as CSV files only (no table images).
Figures are cropped from the page using custom detection and saved as PNGs.
Text blocks (from Mistral OCR) are written to text/text_blocks.txt per page.
A full-page PNG is saved in each page_XX/ directory.
Debug overlays (page_XX_debug_overlay.png) are saved at the PDF root inside output/<pdf_name>/.

📊 Performance

Current Benchmarks

Processing Speed: ~2-5 seconds per page
Current Accuracy: ~80% for figures and tables
Memory Usage: ~200MB for typical documents
Supported Formats: PDF 1.4 - PDF 2.0

🚧 Active Development

We're currently fine-tuning our base models to improve accuracy. The engine is in active development with regular updates to enhance detection performance. We're working towards achieving 90%+ accuracy through:

Model fine-tuning on engineering document datasets
Improved preprocessing pipelines
Enhanced feature extraction algorithms
Community feedback integration

Optimization Tips

Use high-resolution rendering for better accuracy
Adjust parameters based on document type
Process pages in parallel for batch operations
Use debug mode to tune detection parameters

🖼️ Examples

Debug Overlay Analysis

Debug overlay showing detected layout elements: columns (red), text blocks (green), figures (blue), and tables (yellow)

Engineering Document Processing

Complex engineering document with multi-column layout and technical drawings

Extracted Figure

Automatically extracted figure from PDF document

Extracted Table

Automatically extracted table with preserved formatting

Multi-Page Analysis

Consistent layout analysis across multiple pages of technical documents

👥 Contributors

Developers & Maintainers:

@soovittt - Core Developer
@Manushpm8 - Core Developer
@Magnet-AI - Organization

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Setup

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest

# Run with coverage
pytest --cov=src

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built with PyMuPDF for PDF processing
Uses OpenCV for computer vision operations
Inspired by research in document layout analysis

📞 Support

📧 Email: sovitnayak1258@gmail.com
🐛 Issues: GitHub Issues
💬 Discussions: GitHub Discussions

Made with ❤️ for the open source community

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.5

Dec 11, 2025

1.0.4

Dec 11, 2025

1.0.3

Oct 18, 2025

1.0.2

Oct 9, 2025

1.0.1

Oct 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quanta_pdf-1.0.5.tar.gz (61.2 kB view details)

Uploaded Dec 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

quanta_pdf-1.0.5-py3-none-any.whl (63.0 kB view details)

Uploaded Dec 11, 2025 Python 3

File details

Details for the file quanta_pdf-1.0.5.tar.gz.

File metadata

Download URL: quanta_pdf-1.0.5.tar.gz
Upload date: Dec 11, 2025
Size: 61.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for quanta_pdf-1.0.5.tar.gz
Algorithm	Hash digest
SHA256	`dc81ee26feeace3112af701a43fd1f2ca19ac9366c5c29e26db1daa771144d83`
MD5	`2641589f0e9e853eccf362f6a6e8c0d8`
BLAKE2b-256	`f77463d8c75da5c31c1b8606ef11d0071206149a27ef42700b3b00e26109c91e`

See more details on using hashes here.

File details

Details for the file quanta_pdf-1.0.5-py3-none-any.whl.

File metadata

Download URL: quanta_pdf-1.0.5-py3-none-any.whl
Upload date: Dec 11, 2025
Size: 63.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for quanta_pdf-1.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`775cba15d41365229edcce0c84801091c04b96f184f3a116c7f7c9d446f24618`
MD5	`60a3279882aa703ee99b5ed0adb293aa`
BLAKE2b-256	`117e54a7260c9c892730186a05cf589380b1a8ec9dea2595d1e13761e3bb6af0`

See more details on using hashes here.

quanta-pdf 1.0.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Quanta

✨ Features

🚀 Quick Start

Install via PyPI

Basic Usage (Python)

Command Line Interface

Environment configuration (.env)

📖 Documentation

Core Concepts

Layout Analysis Pipeline

Mathematical Foundations

API Reference (package)

extract_document(input_pdf: str | Path, output_dir: str | Path) -> dict

🎯 Use Cases

Engineering & Technical Documents

Academic Research

Document Digitization

Content Management

Data Analysis

🔧 Advanced Configuration

Custom Parameters

Debug Mode

Output Structure

📊 Performance

Current Benchmarks

🚧 Active Development

Optimization Tips

🖼️ Examples

Debug Overlay Analysis

Engineering Document Processing

Extracted Figure

Extracted Table

Multi-Page Analysis

👥 Contributors

🤝 Contributing

Development Setup

📄 License

🙏 Acknowledgments

📞 Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`extract_document(input_pdf: str | Path, output_dir: str | Path) -> dict`