Skip to main content

A high-performance PDF to JSON extraction library with layout-aware text extraction

Project description

pdf_2_json_extractor

License Python Version PyPI Version Coverage Status

A high-performance Python library for extracting structured content from PDF documents with layout-aware text extraction. pdf_2_json_extractor preserves document structure including headings (H1-H6) and body text, outputting clean JSON format.

Features

  • Layout-aware extraction: Detects document structure including headings of different levels using font size and style analysis
  • Multilingual support: Handles Latin, Cyrillic, Asian scripts (Chinese, Japanese, Korean), Arabic, Hebrew, and other complex Unicode scripts
  • High performance: Processes 50-page PDFs in ≤10 seconds on modern CPUs
  • Small footprint: Minimal dependencies, no heavy ML models used
  • Offline operation: No internet connectivity required to run
  • Cross-platform: AMD64 compatible, runs purely on CPU
  • Easy to use: Simple API with both programmatic and CLI interfaces

Installation

pip install pdf_2_json_extractor

Quick Start

Python API

import pdf_2_json_extractor

# Extract PDF to dictionary
result = pdf_2_json_extractor.extract_pdf_to_dict("document.pdf")
print(f"Title: {result['title']}")
print(f"Number of sections: {result['stats']['num_sections']}")

# Extract PDF to JSON string
json_output = pdf_2_json_extractor.extract_pdf_to_json("document.pdf")
print(json_output)

# Save to file
pdf_2_json_extractor.extract_pdf_to_json("document.pdf", "output.json")

Command Line Interface

# Extract to stdout
pdf_2_json_extractor document.pdf

# Save to file
pdf_2_json_extractor document.pdf -o output.json

# Compact output
pdf_2_json_extractor document.pdf --compact

# Pretty print (default)
pdf_2_json_extractor document.pdf --pretty

JSON Output Format

{
  "title": "Document Title",
  "sections": [
    {
      "level": "H1",
      "title": "Chapter 1: Introduction",
      "paragraphs": ["This is the introduction text..."]
    },
    {
      "level": "H2", 
      "title": "1.1 Overview",
      "paragraphs": ["Overview content..."]
    },
    {
      "level": "content",
      "title": null,
      "paragraphs": ["Body text content..."]
    }
  ],
  "font_histogram": {
    "12.0": 1500,
    "14.0": 200,
    "16.0": 50
  },
  "heading_levels": {
    "16.0": "H1",
    "14.0": "H2"
  },
  "stats": {
    "page_count": 25,
    "processing_time": 2.34,
    "num_sections": 15,
    "num_headings": 8,
    "num_paragraphs": 45
  }
}

Advanced Usage

Custom Configuration

from pdf_2_json_extractor import PDFStructureExtractor, Config

# Create custom configuration
config = Config()
config.MAX_PAGES_FOR_FONT_ANALYSIS = 5
config.MIN_HEADING_FREQUENCY = 0.002

# Use with custom config
extractor = PDFStructureExtractor(config)
result = extractor.extract_text_with_structure("document.pdf")

Error Handling

from pdf_2_json_extractor import extract_pdf_to_dict
from pdf_2_json_extractor.exceptions import PdfToJsonError, InvalidPDFError, PDFFileNotFoundError

try:
    result = extract_pdf_to_dict("document.pdf")
except PDFFileNotFoundError:
    print("PDF file not found")
except InvalidPDFError:
    print("Invalid or corrupted PDF file")
except PdfToJsonError as e:
    print(f"Processing error: {e}")

Configuration Options

You can configure pdf_2_json_extractor using environment variables:

# Font analysis settings
export PDF_TO_JSON_MAX_PAGES_FOR_FONT_ANALYSIS=10
export PDF_TO_JSON_FONT_SIZE_PRECISION=0.1
export PDF_TO_JSON_MIN_HEADING_FREQUENCY=0.001

# Text processing settings
export PDF_TO_JSON_MIN_TEXT_LENGTH=3
export PDF_TO_JSON_MAX_HEADING_LEVELS=6
export PDF_TO_JSON_COMBINE_CONSECUTIVE_TEXT=True

# Language support
export PDF_TO_JSON_MULTILINGUAL_SUPPORT=True
export PDF_TO_JSON_DEFAULT_ENCODING=utf-8

# Performance settings
export PDF_TO_JSON_PROCESS_PAGES_IN_CHUNKS=False
export PDF_TO_JSON_CHUNK_SIZE=10

# Debug settings
export PDF_TO_JSON_DEBUG_MODE=False
export PDF_TO_JSON_LOG_LEVEL=INFO

Development

Installation from Source

pip install pdf_2_json_extractor

or

git clone https://github.com/your-username/pdf_2_json_extractor.git
cd pdf_2_json_extractor
pip install -e .

Building the Library

# Build the package
./build.sh

# Or manually
python -m build

Running Tests

pip install -e ".[dev]"
pytest

Docker Development

# Build Docker image
docker build -t pdf_2_json_extractor:latest .

# Run with Docker
docker run --rm -v $(pwd)/test:/test pdf_2_json_extractor:latest /test/document.pdf

Performance

pdf_2_json_extractor is optimized for high performance:

  • CPU-only processing: No GPU requirements
  • Memory efficient: Processes large documents without excessive memory usage
  • Fast extraction: Typical processing times:
    • 10-page document: ~1-2 seconds
    • 50-page document: ~5-10 seconds
    • 100-page document: ~15-25 seconds

Supported Languages

pdf_2_json_extractor supports text extraction from PDFs containing:

  • Latin scripts (English, Spanish, French, German, etc.)
  • Cyrillic scripts (Russian, Bulgarian, Serbian, etc.)
  • Asian scripts (Chinese, Japanese, Korean)
  • Arabic and Hebrew scripts
  • Other Unicode scripts

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

References

This library is inspired by the research paper:

"Layout-Aware Text Extraction from Full-text PDF of Scientific Articles"
Cartic Ramakrishnan, Abhishek Patnia, Eduard Hovy, Gully APC Burns
Published in Source Code for Biology and Medicine (2012)
Full Paper

Support

For questions, issues, or contributions:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_2_json_extractor-1.3.1.tar.gz (25.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_2_json_extractor-1.3.1-py3-none-any.whl (15.4 kB view details)

Uploaded Python 3

File details

Details for the file pdf_2_json_extractor-1.3.1.tar.gz.

File metadata

  • Download URL: pdf_2_json_extractor-1.3.1.tar.gz
  • Upload date:
  • Size: 25.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdf_2_json_extractor-1.3.1.tar.gz
Algorithm Hash digest
SHA256 851596b5669ee2a38cd3a18408a98107e8c31a12073adc6c97127e899da525cd
MD5 a429f7200c5154e15919d4a6e5ba639b
BLAKE2b-256 0b1e135b1ba3e156bd05aee36db743f54db72c47c288b97e8673d7cf2d95fbf2

See more details on using hashes here.

File details

Details for the file pdf_2_json_extractor-1.3.1-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_2_json_extractor-1.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7c15528093ab12329592a822e071bab2919552ea79e2155da2f56059aaa58315
MD5 6fa83987cb1dc0999e5bc979a09da010
BLAKE2b-256 73373dd86a5be2bf1733bc95b8c2be8679e95eed05b7ec4a92719e8622b3ad44

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page