A high-performance PDF to JSON extraction library with layout-aware text extraction

These details have not been verified by PyPI

Project links

Project description

pdf_2_json_extractor

A high-performance Python library for extracting structured content from PDF documents with layout-aware text extraction. pdf_2_json_extractor preserves document structure including headings (H1-H6) and body text, outputting clean JSON format.

Features

Layout-aware extraction: Detects document structure including headings of different levels using font size and style analysis
Multilingual support: Handles Latin, Cyrillic, Asian scripts (Chinese, Japanese, Korean), Arabic, Hebrew, and other complex Unicode scripts
High performance: Processes 50-page PDFs in ≤10 seconds on modern CPUs
Small footprint: Minimal dependencies, no heavy ML models used
Offline operation: No internet connectivity required to run
Cross-platform: AMD64 compatible, runs purely on CPU
Easy to use: Simple API with both programmatic and CLI interfaces

Installation

pip install pdf_2_json_extractor

Quick Start

Python API

import pdf_2_json_extractor

# Extract PDF to dictionary
result = pdf_2_json_extractor.extract_pdf_to_dict("document.pdf")
print(f"Title: {result['title']}")
print(f"Number of sections: {result['stats']['num_sections']}")

# Extract PDF to JSON string
json_output = pdf_2_json_extractor.extract_pdf_to_json("document.pdf")
print(json_output)

# Save to file
pdf_2_json_extractor.extract_pdf_to_json("document.pdf", "output.json")

Command Line Interface

# Extract to stdout
pdf_2_json_extractor document.pdf

# Save to file
pdf_2_json_extractor document.pdf -o output.json

# Compact output
pdf_2_json_extractor document.pdf --compact

# Pretty print (default)
pdf_2_json_extractor document.pdf --pretty

JSON Output Format

{
  "title": "Document Title",
  "sections": [
    {
      "level": "H1",
      "title": "Chapter 1: Introduction",
      "paragraphs": ["This is the introduction text..."]
    },
    {
      "level": "H2", 
      "title": "1.1 Overview",
      "paragraphs": ["Overview content..."]
    },
    {
      "level": "content",
      "title": null,
      "paragraphs": ["Body text content..."]
    }
  ],
  "font_histogram": {
    "12.0": 1500,
    "14.0": 200,
    "16.0": 50
  },
  "heading_levels": {
    "16.0": "H1",
    "14.0": "H2"
  },
  "stats": {
    "page_count": 25,
    "processing_time": 2.34,
    "num_sections": 15,
    "num_headings": 8,
    "num_paragraphs": 45
  }
}

Advanced Usage

Custom Configuration

from pdf_2_json_extractor import PDFStructureExtractor, Config

# Create custom configuration
config = Config()
config.MAX_PAGES_FOR_FONT_ANALYSIS = 5
config.MIN_HEADING_FREQUENCY = 0.002

# Use with custom config
extractor = PDFStructureExtractor(config)
result = extractor.extract_text_with_structure("document.pdf")

Error Handling

from pdf_2_json_extractor import extract_pdf_to_dict
from pdf_2_json_extractor.exceptions import PdfToJsonError, InvalidPDFError, PDFFileNotFoundError

try:
    result = extract_pdf_to_dict("document.pdf")
except PDFFileNotFoundError:
    print("PDF file not found")
except InvalidPDFError:
    print("Invalid or corrupted PDF file")
except PdfToJsonError as e:
    print(f"Processing error: {e}")

Configuration Options

You can configure pdf_2_json_extractor using environment variables:

# Font analysis settings
export PDF_TO_JSON_MAX_PAGES_FOR_FONT_ANALYSIS=10
export PDF_TO_JSON_FONT_SIZE_PRECISION=0.1
export PDF_TO_JSON_MIN_HEADING_FREQUENCY=0.001

# Text processing settings
export PDF_TO_JSON_MIN_TEXT_LENGTH=3
export PDF_TO_JSON_MAX_HEADING_LEVELS=6
export PDF_TO_JSON_COMBINE_CONSECUTIVE_TEXT=True

# Language support
export PDF_TO_JSON_MULTILINGUAL_SUPPORT=True
export PDF_TO_JSON_DEFAULT_ENCODING=utf-8

# Performance settings
export PDF_TO_JSON_PROCESS_PAGES_IN_CHUNKS=False
export PDF_TO_JSON_CHUNK_SIZE=10

# Debug settings
export PDF_TO_JSON_DEBUG_MODE=False
export PDF_TO_JSON_LOG_LEVEL=INFO

Development

Installation from Source

pip install pdf_2_json_extractor

git clone https://github.com/your-username/pdf_2_json_extractor.git
cd pdf_2_json_extractor
pip install -e .

Building the Library

# Build the package
./build.sh

# Or manually
python -m build

Running Tests

pip install -e ".[dev]"
pytest

Docker Development

# Build Docker image
docker build -t pdf_2_json_extractor:latest .

# Run with Docker
docker run --rm -v $(pwd)/test:/test pdf_2_json_extractor:latest /test/document.pdf

Performance

pdf_2_json_extractor is optimized for high performance:

CPU-only processing: No GPU requirements
Memory efficient: Processes large documents without excessive memory usage
Fast extraction: Typical processing times:
- 10-page document: ~1-2 seconds
- 50-page document: ~5-10 seconds
- 100-page document: ~15-25 seconds

Supported Languages

pdf_2_json_extractor supports text extraction from PDFs containing:

Latin scripts (English, Spanish, French, German, etc.)
Cyrillic scripts (Russian, Bulgarian, Serbian, etc.)
Asian scripts (Chinese, Japanese, Korean)
Arabic and Hebrew scripts
Other Unicode scripts

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

References

This library is inspired by the research paper:

"Layout-Aware Text Extraction from Full-text PDF of Scientific Articles"
Cartic Ramakrishnan, Abhishek Patnia, Eduard Hovy, Gully APC Burns
Published in Source Code for Biology and Medicine (2012)
Full Paper

Support

For questions, issues, or contributions:

📧 Email: rishibalapure12@gmail.com
🐛 Issues: GitHub Issues
📖 Documentation: GitHub Wiki

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.3.1

Apr 22, 2026

1.2.0

Jan 6, 2026

This version

1.1.0

Jan 1, 2026

1.0.0

Dec 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_2_json_extractor-1.1.0.tar.gz (24.2 kB view details)

Uploaded Jan 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf_2_json_extractor-1.1.0-py3-none-any.whl (15.6 kB view details)

Uploaded Jan 1, 2026 Python 3

File details

Details for the file pdf_2_json_extractor-1.1.0.tar.gz.

File metadata

Download URL: pdf_2_json_extractor-1.1.0.tar.gz
Upload date: Jan 1, 2026
Size: 24.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pdf_2_json_extractor-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`c9a5b3ac2b863ca84f19737165ef51a7617312350e2af817bd4b4ac140c3503c`
MD5	`dcaf81b63123fc912689a171fa5c521f`
BLAKE2b-256	`87b8d4b7ed421a3fb92b24e429f69a3fb6f8b63c3fefe347dbcd89c407f82d01`

See more details on using hashes here.

File details

Details for the file pdf_2_json_extractor-1.1.0-py3-none-any.whl.

File metadata

Download URL: pdf_2_json_extractor-1.1.0-py3-none-any.whl
Upload date: Jan 1, 2026
Size: 15.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pdf_2_json_extractor-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3d258d28fe2177e0c9a978a7adcbd1da1089a29d7df5d268560202c78ca9cd08`
MD5	`a49aa682bd48049ea69c19d532e7119f`
BLAKE2b-256	`f257540b82bf294e65ea689f7af1653531b9eaa60fd4f0e68d5112d4603daa4b`

See more details on using hashes here.

pdf-2-json-extractor 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pdf_2_json_extractor

Features

Installation

Quick Start

Python API

Command Line Interface

JSON Output Format

Advanced Usage

Custom Configuration

Error Handling

Configuration Options

Development

Installation from Source

Building the Library

Running Tests

Docker Development

Performance

Supported Languages

License

Contributing

References

Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes