Skip to main content

A high-performance PDF to JSON extraction library with layout-aware text extraction

Project description

pdf_2_json_extractor

License Python Version PyPI Version Coverage Status

A high-performance Python library for extracting structured content from PDF documents with layout-aware text extraction. pdf_2_json_extractor preserves document structure including headings (H1-H6) and body text, outputting clean JSON format.

Features

  • Layout-aware extraction: Detects document structure including headings of different levels using font size and style analysis
  • Multilingual support: Handles Latin, Cyrillic, Asian scripts (Chinese, Japanese, Korean), Arabic, Hebrew, and other complex Unicode scripts
  • High performance: Processes 50-page PDFs in ≤10 seconds on modern CPUs
  • Small footprint: Minimal dependencies, no heavy ML models used
  • Offline operation: No internet connectivity required to run
  • Cross-platform: AMD64 compatible, runs purely on CPU
  • Easy to use: Simple API with both programmatic and CLI interfaces

Installation

pip install pdf_2_json_extractor

Quick Start

Python API

import pdf_2_json_extractor

# Extract PDF to dictionary
result = pdf_2_json_extractor.extract_pdf_to_dict("document.pdf")
print(f"Title: {result['title']}")
print(f"Number of sections: {result['stats']['num_sections']}")

# Extract PDF to JSON string
json_output = pdf_2_json_extractor.extract_pdf_to_json("document.pdf")
print(json_output)

# Save to file
pdf_2_json_extractor.extract_pdf_to_json("document.pdf", "output.json")

Command Line Interface

# Extract to stdout
pdf_2_json_extractor document.pdf

# Save to file
pdf_2_json_extractor document.pdf -o output.json

# Compact output
pdf_2_json_extractor document.pdf --compact

# Pretty print (default)
pdf_2_json_extractor document.pdf --pretty

JSON Output Format

{
  "title": "Document Title",
  "sections": [
    {
      "level": "H1",
      "title": "Chapter 1: Introduction",
      "paragraphs": ["This is the introduction text..."]
    },
    {
      "level": "H2", 
      "title": "1.1 Overview",
      "paragraphs": ["Overview content..."]
    },
    {
      "level": "content",
      "title": null,
      "paragraphs": ["Body text content..."]
    }
  ],
  "font_histogram": {
    "12.0": 1500,
    "14.0": 200,
    "16.0": 50
  },
  "heading_levels": {
    "16.0": "H1",
    "14.0": "H2"
  },
  "stats": {
    "page_count": 25,
    "processing_time": 2.34,
    "num_sections": 15,
    "num_headings": 8,
    "num_paragraphs": 45
  }
}

Advanced Usage

Custom Configuration

from pdf_2_json_extractor import PDFStructureExtractor, Config

# Create custom configuration
config = Config()
config.MAX_PAGES_FOR_FONT_ANALYSIS = 5
config.MIN_HEADING_FREQUENCY = 0.002

# Use with custom config
extractor = PDFStructureExtractor(config)
result = extractor.extract_text_with_structure("document.pdf")

Error Handling

from pdf_2_json_extractor import extract_pdf_to_dict
from pdf_2_json_extractor.exceptions import PdfToJsonError, InvalidPDFError, PDFFileNotFoundError

try:
    result = extract_pdf_to_dict("document.pdf")
except PDFFileNotFoundError:
    print("PDF file not found")
except InvalidPDFError:
    print("Invalid or corrupted PDF file")
except PdfToJsonError as e:
    print(f"Processing error: {e}")

Configuration Options

You can configure pdf_2_json_extractor using environment variables:

# Font analysis settings
export PDF_TO_JSON_MAX_PAGES_FOR_FONT_ANALYSIS=10
export PDF_TO_JSON_FONT_SIZE_PRECISION=0.1
export PDF_TO_JSON_MIN_HEADING_FREQUENCY=0.001

# Text processing settings
export PDF_TO_JSON_MIN_TEXT_LENGTH=3
export PDF_TO_JSON_MAX_HEADING_LEVELS=6
export PDF_TO_JSON_COMBINE_CONSECUTIVE_TEXT=True

# Language support
export PDF_TO_JSON_MULTILINGUAL_SUPPORT=True
export PDF_TO_JSON_DEFAULT_ENCODING=utf-8

# Performance settings
export PDF_TO_JSON_PROCESS_PAGES_IN_CHUNKS=False
export PDF_TO_JSON_CHUNK_SIZE=10

# Debug settings
export PDF_TO_JSON_DEBUG_MODE=False
export PDF_TO_JSON_LOG_LEVEL=INFO

Development

Installation from Source

pip install pdf_2_json_extractor

or

git clone https://github.com/your-username/pdf_2_json_extractor.git
cd pdf_2_json_extractor
pip install -e .

Building the Library

# Build the package
./build.sh

# Or manually
python -m build

Running Tests

pip install -e ".[dev]"
pytest

Docker Development

# Build Docker image
docker build -t pdf_2_json_extractor:latest .

# Run with Docker
docker run --rm -v $(pwd)/test:/test pdf_2_json_extractor:latest /test/document.pdf

Performance

pdf_2_json_extractor is optimized for high performance:

  • CPU-only processing: No GPU requirements
  • Memory efficient: Processes large documents without excessive memory usage
  • Fast extraction: Typical processing times:
    • 10-page document: ~1-2 seconds
    • 50-page document: ~5-10 seconds
    • 100-page document: ~15-25 seconds

Supported Languages

pdf_2_json_extractor supports text extraction from PDFs containing:

  • Latin scripts (English, Spanish, French, German, etc.)
  • Cyrillic scripts (Russian, Bulgarian, Serbian, etc.)
  • Asian scripts (Chinese, Japanese, Korean)
  • Arabic and Hebrew scripts
  • Other Unicode scripts

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

References

This library is inspired by the research paper:

"Layout-Aware Text Extraction from Full-text PDF of Scientific Articles"
Cartic Ramakrishnan, Abhishek Patnia, Eduard Hovy, Gully APC Burns
Published in Source Code for Biology and Medicine (2012)
Full Paper

Support

For questions, issues, or contributions:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_2_json_extractor-1.1.0.tar.gz (24.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_2_json_extractor-1.1.0-py3-none-any.whl (15.6 kB view details)

Uploaded Python 3

File details

Details for the file pdf_2_json_extractor-1.1.0.tar.gz.

File metadata

  • Download URL: pdf_2_json_extractor-1.1.0.tar.gz
  • Upload date:
  • Size: 24.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pdf_2_json_extractor-1.1.0.tar.gz
Algorithm Hash digest
SHA256 c9a5b3ac2b863ca84f19737165ef51a7617312350e2af817bd4b4ac140c3503c
MD5 dcaf81b63123fc912689a171fa5c521f
BLAKE2b-256 87b8d4b7ed421a3fb92b24e429f69a3fb6f8b63c3fefe347dbcd89c407f82d01

See more details on using hashes here.

File details

Details for the file pdf_2_json_extractor-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_2_json_extractor-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3d258d28fe2177e0c9a978a7adcbd1da1089a29d7df5d268560202c78ca9cd08
MD5 a49aa682bd48049ea69c19d532e7119f
BLAKE2b-256 f257540b82bf294e65ea689f7af1653531b9eaa60fd4f0e68d5112d4603daa4b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page