A high-performance PDF to JSON extraction library with layout-aware text extraction
Project description
pdf_2_json_extractor
A high-performance Python library for extracting structured content from PDF documents with layout-aware text extraction. pdf_2_json_extractor preserves document structure including headings (H1-H6) and body text, outputting clean JSON format.
Features
- Layout-aware extraction: Detects document structure including headings of different levels using font size and style analysis
- Multilingual support: Handles Latin, Cyrillic, Asian scripts (Chinese, Japanese, Korean), Arabic, Hebrew, and other complex Unicode scripts
- High performance: Processes 50-page PDFs in ≤10 seconds on modern CPUs
- Small footprint: Minimal dependencies, no heavy ML models used
- Offline operation: No internet connectivity required to run
- Cross-platform: AMD64 compatible, runs purely on CPU
- Easy to use: Simple API with both programmatic and CLI interfaces
Installation
pip install pdf_2_json_extractor
Quick Start
Python API
import pdf_2_json_extractor
# Extract PDF to dictionary
result = pdf_2_json_extractor.extract_pdf_to_dict("document.pdf")
print(f"Title: {result['title']}")
print(f"Number of sections: {result['stats']['num_sections']}")
# Extract PDF to JSON string
json_output = pdf_2_json_extractor.extract_pdf_to_json("document.pdf")
print(json_output)
# Save to file
pdf_2_json_extractor.extract_pdf_to_json("document.pdf", "output.json")
Command Line Interface
# Extract to stdout
pdf_2_json_extractor document.pdf
# Save to file
pdf_2_json_extractor document.pdf -o output.json
# Compact output
pdf_2_json_extractor document.pdf --compact
# Pretty print (default)
pdf_2_json_extractor document.pdf --pretty
JSON Output Format
{
"title": "Document Title",
"sections": [
{
"level": "H1",
"title": "Chapter 1: Introduction",
"paragraphs": ["This is the introduction text..."]
},
{
"level": "H2",
"title": "1.1 Overview",
"paragraphs": ["Overview content..."]
},
{
"level": "content",
"title": null,
"paragraphs": ["Body text content..."]
}
],
"font_histogram": {
"12.0": 1500,
"14.0": 200,
"16.0": 50
},
"heading_levels": {
"16.0": "H1",
"14.0": "H2"
},
"stats": {
"page_count": 25,
"processing_time": 2.34,
"num_sections": 15,
"num_headings": 8,
"num_paragraphs": 45
}
}
Advanced Usage
Custom Configuration
from pdf_2_json_extractor import PDFStructureExtractor, Config
# Create custom configuration
config = Config()
config.MAX_PAGES_FOR_FONT_ANALYSIS = 5
config.MIN_HEADING_FREQUENCY = 0.002
# Use with custom config
extractor = PDFStructureExtractor(config)
result = extractor.extract_text_with_structure("document.pdf")
Error Handling
from pdf_2_json_extractor import extract_pdf_to_dict
from pdf_2_json_extractor.exceptions import PdfToJsonError, InvalidPDFError, PDFFileNotFoundError
try:
result = extract_pdf_to_dict("document.pdf")
except PDFFileNotFoundError:
print("PDF file not found")
except InvalidPDFError:
print("Invalid or corrupted PDF file")
except PdfToJsonError as e:
print(f"Processing error: {e}")
Configuration Options
You can configure pdf_2_json_extractor using environment variables:
# Font analysis settings
export PDF_TO_JSON_MAX_PAGES_FOR_FONT_ANALYSIS=10
export PDF_TO_JSON_FONT_SIZE_PRECISION=0.1
export PDF_TO_JSON_MIN_HEADING_FREQUENCY=0.001
# Text processing settings
export PDF_TO_JSON_MIN_TEXT_LENGTH=3
export PDF_TO_JSON_MAX_HEADING_LEVELS=6
export PDF_TO_JSON_COMBINE_CONSECUTIVE_TEXT=True
# Language support
export PDF_TO_JSON_MULTILINGUAL_SUPPORT=True
export PDF_TO_JSON_DEFAULT_ENCODING=utf-8
# Performance settings
export PDF_TO_JSON_PROCESS_PAGES_IN_CHUNKS=False
export PDF_TO_JSON_CHUNK_SIZE=10
# Debug settings
export PDF_TO_JSON_DEBUG_MODE=False
export PDF_TO_JSON_LOG_LEVEL=INFO
Development
Installation from Source
pip install pdf_2_json_extractor
or
git clone https://github.com/your-username/pdf_2_json_extractor.git
cd pdf_2_json_extractor
pip install -e .
Building the Library
# Build the package
./build.sh
# Or manually
python -m build
Running Tests
pip install -e ".[dev]"
pytest
Docker Development
# Build Docker image
docker build -t pdf_2_json_extractor:latest .
# Run with Docker
docker run --rm -v $(pwd)/test:/test pdf_2_json_extractor:latest /test/document.pdf
Performance
pdf_2_json_extractor is optimized for high performance:
- CPU-only processing: No GPU requirements
- Memory efficient: Processes large documents without excessive memory usage
- Fast extraction: Typical processing times:
- 10-page document: ~1-2 seconds
- 50-page document: ~5-10 seconds
- 100-page document: ~15-25 seconds
Supported Languages
pdf_2_json_extractor supports text extraction from PDFs containing:
- Latin scripts (English, Spanish, French, German, etc.)
- Cyrillic scripts (Russian, Bulgarian, Serbian, etc.)
- Asian scripts (Chinese, Japanese, Korean)
- Arabic and Hebrew scripts
- Other Unicode scripts
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
References
This library is inspired by the research paper:
"Layout-Aware Text Extraction from Full-text PDF of Scientific Articles"
Cartic Ramakrishnan, Abhishek Patnia, Eduard Hovy, Gully APC Burns
Published in Source Code for Biology and Medicine (2012)
Full Paper
Support
For questions, issues, or contributions:
- 📧 Email: rishibalapure12@gmail.com
- 🐛 Issues: GitHub Issues
- 📖 Documentation: GitHub Wiki
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf_2_json_extractor-1.2.0.tar.gz.
File metadata
- Download URL: pdf_2_json_extractor-1.2.0.tar.gz
- Upload date:
- Size: 24.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aea40fc4c6a235dea10d73f7432f1bc7926d05ea219cbb78f9fdac3efd1b913c
|
|
| MD5 |
939b00cb3507097324bc51bac8cd5178
|
|
| BLAKE2b-256 |
bccc4da3036d5a5158572ad6caccb1276667c7b340bbcf7c36692bb2f0607102
|
File details
Details for the file pdf_2_json_extractor-1.2.0-py3-none-any.whl.
File metadata
- Download URL: pdf_2_json_extractor-1.2.0-py3-none-any.whl
- Upload date:
- Size: 15.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab78f13777e80d0d7be3a144eda21628115659322bde9f932adecfa5defe5ce4
|
|
| MD5 |
4e8a0aaa9c10b918b490bd06ad4c85d1
|
|
| BLAKE2b-256 |
adef72761d764b2a812f9bb2e69a1b664a0348c15debb44d30fc8f2426129880
|