Advanced PDF layout analysis engine for extracting figures, tables, and structured content
Project description
A powerful and intelligent PDF layout analysis engine that automatically extracts figures, tables, and structured content from PDF documents using advanced computer vision and machine learning techniques.
๐ฏ Problem Solved: Complex engineering documents often lose critical visual information (figures, diagrams, technical drawings) when being parsed by traditional PDF tools. This engine specifically addresses the challenge of accurately detecting and extracting visual elements from technical and engineering PDFs that contain intricate layouts, multi-column designs, and embedded graphics.
Debug overlay showing detected layout elements: columns (blue), text blocks (green), figures (red), and tables (yellow)
โจ Features
- ๐ Multi-column Layout Detection - Automatically identifies and processes complex multi-column layouts
- ๐ Intelligent Table Recognition (Mistral OCR) - Extracts tables and text with high accuracy via Mistral Document OCR
- ๐ผ๏ธ Figure Extraction (Custom) - Identifies and extracts figures, diagrams, and images using custom algorithms
- ๐ Text Block Analysis (Mistral + Heuristics) - Uses Mistral OCR output and in-house grouping for reading order
- ๐ท๏ธ Caption Linking - Automatically links captions to their corresponding figures and tables
- ๐ฏ High Accuracy - Advanced algorithms ensure reliable content extraction
- โก Fast Processing - Optimized for speed and efficiency
- ๐ ๏ธ Easy Integration - Simple API for integration into existing workflows
- ๐ง Debug Mode - Visualize layout analysis with overlay images
๐ Quick Start
Install via PyPI
pip install quanta-pdf
Basic Usage (Python)
from quanta import extract_document
result = extract_document("document.pdf", "output/")
print(f"Pages: {len(result['pages'])}")
Command Line Interface
quanta --input document.pdf --output output/
If you want Mistral OCR tables/text, set MISTRAL_API_KEY first (see below).
Environment configuration (.env)
To enable Mistral OCR for tables and text blocks, set your API key. You can either export it or place it in a .env file at your project root.
# Option A: environment variable
export MISTRAL_API_KEY="your-mistral-api-key"
# Option B: .env file (same directory where you run the code)
echo "MISTRAL_API_KEY=your-mistral-api-key" > .env
The library loads .env automatically; the CLI also picks it up when run from that directory.
๐ Documentation
Core Concepts
Layout Analysis Pipeline
The engine follows a sophisticated multi-stage pipeline:
- PDF Rendering - Converts PDF pages to high-resolution images
- Column Detection - Identifies multi-column layouts using whitespace analysis
- Text Extraction - Extracts and groups text blocks
- Figure Detection - Identifies figures using vector clustering and image analysis
- Table & Text Recognition (Mistral OCR) - Leverages Mistral Document OCR to extract tables (CSV) and text blocks
- Caption Linking - Links captions to their corresponding figures/tables
- Reading Order - Determines proper reading sequence
Mathematical Foundations
Column Detection Algorithm:
- Uses whitespace valley analysis to identify column boundaries
- Applies Gaussian smoothing to detect consistent vertical gaps
- Implements adaptive thresholding for varying document layouts
Table/Text Extraction:
- Uses Mistral Document OCR to obtain markdown-like structured output
- Parses tables into CSV files and groups text into blocks
Figure Detection:
- Vector clustering using DBSCAN algorithm
- Aspect ratio analysis to distinguish figures from tables
- Image XObject extraction for embedded graphics
API Reference (package)
extract_document(input_pdf: str | Path, output_dir: str | Path) -> dict
Process a PDF document and extract structured content.
Parameters:
input_pdf: Path to the input PDF fileoutput_dir: Directory to save extracted content
Returns:
Dict[str, Any]: Processing results containing figures, tables, and metadata
Example:
from quanta import extract_document
result = extract_document("research_paper.pdf", "output/")
print(result["summary_path"]) # JSON summary path
๐ฏ Use Cases
Engineering & Technical Documents
- Technical Drawings: Extract engineering diagrams and CAD drawings
- Specification Sheets: Parse technical specifications and data tables
- Engineering Reports: Process complex multi-column technical reports
- Manufacturing Docs: Extract assembly instructions and part diagrams
Academic Research
- Extract figures and tables from research papers
- Analyze document structure and layout
- Process large collections of academic PDFs
Document Digitization
- Convert PDF documents to structured data
- Extract content for database storage
- Prepare documents for text analysis
Content Management
- Automatically categorize document content
- Extract metadata and captions
- Generate document summaries
Data Analysis
- Extract tabular data from reports
- Process financial documents
- Analyze technical specifications
๐ง Advanced Configuration
Custom Parameters
from pdf_layout_engine import process_pdf
# Custom processing parameters
config = {
'min_figure_area': 1000,
'table_detection_threshold': 0.7,
'column_detection_sensitivity': 0.8
}
result = process_pdf("document.pdf", "output/", config=config)
Debug Mode
Enable debug mode to visualize the layout analysis process:
python main.py --debug
This generates overlay images showing:
- ๐ฆ Blue rectangles: Column boundaries
- ๐ข Green rectangles: Text blocks
- ๐ฅ Red rectangles: Figures
- ๐ก Yellow rectangles: Tables
Output Structure
Results are organized per page under the PDF name inside output/.
Example:
output/<pdf_name>/
โโโ page_01/
โ โโโ figures/
โ โ โโโ figure_01.png
โ โโโ tables/
โ โ โโโ table_01.csv # tables saved as CSV only (no table PNGs)
โ โโโ text/
โ โ โโโ text_blocks.txt # text blocks from Mistral OCR
โ โโโ page_01.png # full page image
โโโ page_02/
โ โโโ ...
โโโ page_XX_debug_overlay.png # debug overlay for each processed page (at root)
โโโ summary.json # high-level summary (counts, filenames)
Key points:
- Tables are saved as CSV files only (no table images).
- Figures are cropped from the page using custom detection and saved as PNGs.
- Text blocks (from Mistral OCR) are written to
text/text_blocks.txtper page. - A full-page PNG is saved in each
page_XX/directory. - Debug overlays (
page_XX_debug_overlay.png) are saved at the PDF root insideoutput/<pdf_name>/.
๐ Performance
Current Benchmarks
- Processing Speed: ~2-5 seconds per page
- Current Accuracy: ~80% for figures and tables
- Memory Usage: ~200MB for typical documents
- Supported Formats: PDF 1.4 - PDF 2.0
๐ง Active Development
We're currently fine-tuning our base models to improve accuracy. The engine is in active development with regular updates to enhance detection performance. We're working towards achieving 90%+ accuracy through:
- Model fine-tuning on engineering document datasets
- Improved preprocessing pipelines
- Enhanced feature extraction algorithms
- Community feedback integration
Optimization Tips
- Use high-resolution rendering for better accuracy
- Adjust parameters based on document type
- Process pages in parallel for batch operations
- Use debug mode to tune detection parameters
๐ผ๏ธ Examples
Debug Overlay Analysis
Debug overlay showing detected layout elements: columns (red), text blocks (green), figures (blue), and tables (yellow)
Engineering Document Processing
Complex engineering document with multi-column layout and technical drawings
Extracted Figure
Automatically extracted figure from PDF document
Extracted Table
Automatically extracted table with preserved formatting
Multi-Page Analysis
Consistent layout analysis across multiple pages of technical documents
๐ฅ Contributors
Developers & Maintainers:
- @soovittt - Core Developer
- @Manushpm8 - Core Developer
- @Magnet-AI - Organization
๐ค Contributing
We welcome contributions! Please see our Contributing Guide for details.
Development Setup
# Install in development mode
pip install -e ".[dev]"
# Run tests
pytest
# Run with coverage
pytest --cov=src
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
- Built with PyMuPDF for PDF processing
- Uses OpenCV for computer vision operations
- Inspired by research in document layout analysis
๐ Support
- ๐ง Email: sovitnayak1258@gmail.com
- ๐ Issues: GitHub Issues
- ๐ฌ Discussions: GitHub Discussions
Made with โค๏ธ for the open source community
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file quanta_pdf-1.0.5.tar.gz.
File metadata
- Download URL: quanta_pdf-1.0.5.tar.gz
- Upload date:
- Size: 61.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dc81ee26feeace3112af701a43fd1f2ca19ac9366c5c29e26db1daa771144d83
|
|
| MD5 |
2641589f0e9e853eccf362f6a6e8c0d8
|
|
| BLAKE2b-256 |
f77463d8c75da5c31c1b8606ef11d0071206149a27ef42700b3b00e26109c91e
|
File details
Details for the file quanta_pdf-1.0.5-py3-none-any.whl.
File metadata
- Download URL: quanta_pdf-1.0.5-py3-none-any.whl
- Upload date:
- Size: 63.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
775cba15d41365229edcce0c84801091c04b96f184f3a116c7f7c9d446f24618
|
|
| MD5 |
60a3279882aa703ee99b5ed0adb293aa
|
|
| BLAKE2b-256 |
117e54a7260c9c892730186a05cf589380b1a8ec9dea2595d1e13761e3bb6af0
|