Skip to main content

A high-performance PDF to JSON extraction library with layout-aware text extraction

Project description

pdf_2_json_extractor

License Python Version PyPI Version Coverage Status

A high-performance Python library for extracting structured content from PDF documents with layout-aware text extraction. pdf_2_json_extractor preserves document structure including headings (H1-H6) and body text, outputting clean JSON format.

Features

  • Layout-aware extraction: Detects document structure including headings of different levels using font size and style analysis
  • Multilingual support: Handles Latin, Cyrillic, Asian scripts (Chinese, Japanese, Korean), Arabic, Hebrew, and other complex Unicode scripts
  • High performance: Processes 50-page PDFs in ≤10 seconds on modern CPUs
  • Small footprint: Minimal dependencies, no heavy ML models used
  • Offline operation: No internet connectivity required to run
  • Cross-platform: AMD64 compatible, runs purely on CPU
  • Easy to use: Simple API with both programmatic and CLI interfaces

Installation

pip install pdf_2_json_extractor

Quick Start

Python API

import pdf_2_json_extractor

# Extract PDF to dictionary
result = pdf_2_json_extractor.extract_pdf_to_dict("document.pdf")
print(f"Title: {result['title']}")
print(f"Number of sections: {result['stats']['num_sections']}")

# Extract PDF to JSON string
json_output = pdf_2_json_extractor.extract_pdf_to_json("document.pdf")
print(json_output)

# Save to file
pdf_2_json_extractor.extract_pdf_to_json("document.pdf", "output.json")

Command Line Interface

# Extract to stdout
pdf_2_json_extractor document.pdf

# Save to file
pdf_2_json_extractor document.pdf -o output.json

# Compact output
pdf_2_json_extractor document.pdf --compact

# Pretty print (default)
pdf_2_json_extractor document.pdf --pretty

JSON Output Format

{
  "title": "Document Title",
  "sections": [
    {
      "level": "H1",
      "title": "Chapter 1: Introduction",
      "paragraphs": ["This is the introduction text..."]
    },
    {
      "level": "H2", 
      "title": "1.1 Overview",
      "paragraphs": ["Overview content..."]
    },
    {
      "level": "content",
      "title": null,
      "paragraphs": ["Body text content..."]
    }
  ],
  "font_histogram": {
    "12.0": 1500,
    "14.0": 200,
    "16.0": 50
  },
  "heading_levels": {
    "16.0": "H1",
    "14.0": "H2"
  },
  "stats": {
    "page_count": 25,
    "processing_time": 2.34,
    "num_sections": 15,
    "num_headings": 8,
    "num_paragraphs": 45
  }
}

Advanced Usage

Custom Configuration

from pdf_2_json_extractor import PDFStructureExtractor, Config

# Create custom configuration
config = Config()
config.MAX_PAGES_FOR_FONT_ANALYSIS = 5
config.MIN_HEADING_FREQUENCY = 0.002

# Use with custom config
extractor = PDFStructureExtractor(config)
result = extractor.extract_text_with_structure("document.pdf")

Error Handling

from pdf_2_json_extractor import extract_pdf_to_dict
from pdf_2_json_extractor.exceptions import PdfToJsonError, InvalidPDFError, PDFFileNotFoundError

try:
    result = extract_pdf_to_dict("document.pdf")
except PDFFileNotFoundError:
    print("PDF file not found")
except InvalidPDFError:
    print("Invalid or corrupted PDF file")
except PdfToJsonError as e:
    print(f"Processing error: {e}")

Configuration Options

You can configure pdf_2_json_extractor using environment variables:

# Font analysis settings
export PDF_TO_JSON_MAX_PAGES_FOR_FONT_ANALYSIS=10
export PDF_TO_JSON_FONT_SIZE_PRECISION=0.1
export PDF_TO_JSON_MIN_HEADING_FREQUENCY=0.001

# Text processing settings
export PDF_TO_JSON_MIN_TEXT_LENGTH=3
export PDF_TO_JSON_MAX_HEADING_LEVELS=6
export PDF_TO_JSON_COMBINE_CONSECUTIVE_TEXT=True

# Language support
export PDF_TO_JSON_MULTILINGUAL_SUPPORT=True
export PDF_TO_JSON_DEFAULT_ENCODING=utf-8

# Performance settings
export PDF_TO_JSON_PROCESS_PAGES_IN_CHUNKS=False
export PDF_TO_JSON_CHUNK_SIZE=10

# Debug settings
export PDF_TO_JSON_DEBUG_MODE=False
export PDF_TO_JSON_LOG_LEVEL=INFO

Development

Installation from Source

pip install pdf_2_json_extractor

or

git clone https://github.com/your-username/pdf_2_json_extractor.git
cd pdf_2_json_extractor
pip install -e .

Building the Library

# Build the package
./build.sh

# Or manually
python -m build

Running Tests

pip install -e ".[dev]"
pytest

Docker Development

# Build Docker image
docker build -t pdf_2_json_extractor:latest .

# Run with Docker
docker run --rm -v $(pwd)/test:/test pdf_2_json_extractor:latest /test/document.pdf

Performance

pdf_2_json_extractor is optimized for high performance:

  • CPU-only processing: No GPU requirements
  • Memory efficient: Processes large documents without excessive memory usage
  • Fast extraction: Typical processing times:
    • 10-page document: ~1-2 seconds
    • 50-page document: ~5-10 seconds
    • 100-page document: ~15-25 seconds

Supported Languages

pdf_2_json_extractor supports text extraction from PDFs containing:

  • Latin scripts (English, Spanish, French, German, etc.)
  • Cyrillic scripts (Russian, Bulgarian, Serbian, etc.)
  • Asian scripts (Chinese, Japanese, Korean)
  • Arabic and Hebrew scripts
  • Other Unicode scripts

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

References

This library is inspired by the research paper:

"Layout-Aware Text Extraction from Full-text PDF of Scientific Articles"
Cartic Ramakrishnan, Abhishek Patnia, Eduard Hovy, Gully APC Burns
Published in Source Code for Biology and Medicine (2012)
Full Paper

Support

For questions, issues, or contributions:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_2_json_extractor-1.2.0.tar.gz (24.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_2_json_extractor-1.2.0-py3-none-any.whl (15.3 kB view details)

Uploaded Python 3

File details

Details for the file pdf_2_json_extractor-1.2.0.tar.gz.

File metadata

  • Download URL: pdf_2_json_extractor-1.2.0.tar.gz
  • Upload date:
  • Size: 24.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pdf_2_json_extractor-1.2.0.tar.gz
Algorithm Hash digest
SHA256 aea40fc4c6a235dea10d73f7432f1bc7926d05ea219cbb78f9fdac3efd1b913c
MD5 939b00cb3507097324bc51bac8cd5178
BLAKE2b-256 bccc4da3036d5a5158572ad6caccb1276667c7b340bbcf7c36692bb2f0607102

See more details on using hashes here.

File details

Details for the file pdf_2_json_extractor-1.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_2_json_extractor-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ab78f13777e80d0d7be3a144eda21628115659322bde9f932adecfa5defe5ce4
MD5 4e8a0aaa9c10b918b490bd06ad4c85d1
BLAKE2b-256 adef72761d764b2a812f9bb2e69a1b664a0348c15debb44d30fc8f2426129880

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page