Skip to main content

A Python package to parse scientific PDF documents into structured Markdown.

Project description

pyScientificPdfParser

A Python package to parse scientific PDF documents into structured Markdown, leveraging modern Document AI models.

Overview

This package provides a pipeline to process scientific PDFs (both born-digital and scanned) and convert them into structured, machine-readable formats like GitHub Flavored Markdown and JSON. It uses a series of state-of-the-art models for layout analysis, table recognition, and optional LLM-based refinement.

Features

  • PDF Processing: Handles single files, lists of files, or entire directories.
  • OCR: Uses Tesseract for robust text extraction from scanned documents.
  • Document Layout Analysis (DLA): Employs models like LayoutLMv3 or DiT to identify page regions (title, text, tables, figures).
  • Table Structure Recognition (TSR): Utilizes models like Table Transformer to parse the structure of complex tables.
  • Section Segmentation: Logically groups content into IMRaD sections.
  • LLM Refinement (Optional): Uses Large Language Models for OCR correction, text flow normalization, and structured data extraction (e.g., references).
  • Multiple Output Formats: Generates clean Markdown, structured JSON, and extracts image assets.

Installation

Install the package from PyPI:

# Base installation
pip install pyscientificpdfparser

# To include machine learning models for layout analysis and table recognition
pip install pyscientificpdfparser[ml]

# For full functionality, including LLM-based refinement
pip install pyscientificpdfparser[ml,llm]

Note: This package requires a system-level installation of Tesseract for OCR. Please see the full installation guide for details.

Usage

Command-Line Interface (CLI)

scipdfparser process path/to/your/document.pdf --output-dir ./output

Python API

import pathlib
from pyscientificpdfparser.core import parse_pdf

# Define the path to your PDF and the desired output directory
pdf_path = pathlib.Path("path/to/your/document.pdf")
output_dir = pathlib.Path("path/to/output")

# Create the output directory if it doesn't exist
output_dir.mkdir(parents=True, exist_ok=True)

print(f"Processing {pdf_path.name}...")

# Call the parser
# The 'document' object contains all the parsed data.
document = parse_pdf(
    pdf_path=pdf_path,
    output_dir=output_dir,
    llm_refine=False,  # Optional: set to True to enable LLM refinement
)

print("Done.")
print(f"Markdown and assets saved to: {output_dir}")

Development

This project uses poetry for dependency management and pre-commit for code quality. Initially developed https://github.com/gowthamrao/pyScientificPdfParser/tree/develop

# Install development dependencies
poetry install

# Activate pre-commit hooks
poetry run pre-commit install

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyscientificpdfparser-0.1.2.tar.gz (19.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyscientificpdfparser-0.1.2-py3-none-any.whl (23.3 kB view details)

Uploaded Python 3

File details

Details for the file pyscientificpdfparser-0.1.2.tar.gz.

File metadata

  • Download URL: pyscientificpdfparser-0.1.2.tar.gz
  • Upload date:
  • Size: 19.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyscientificpdfparser-0.1.2.tar.gz
Algorithm Hash digest
SHA256 678e4a67bf2c8ff328814ac6da56e05f4590dbbae3ce7a0c86e31e68dd3202e2
MD5 378bc29ad951f050dfc65d4141a7f180
BLAKE2b-256 9bf4fd9ed76f28f9c54a4ed74d2c14302e4058933e7368d52bc83a1aa95c6ce6

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyscientificpdfparser-0.1.2.tar.gz:

Publisher: publish.yml on OHDSI/pyscientificpdfparser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyscientificpdfparser-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for pyscientificpdfparser-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 feceeb55f9587a56c65b14651e2437a327c52b22ffa413adbcbafe38a1e52fc8
MD5 92e7370562f8e60a582c711675f41abb
BLAKE2b-256 13b60c7a6d051cf93dff23e713c3043a5bcb05c505d50dbdfa8ddc0d6b795b34

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyscientificpdfparser-0.1.2-py3-none-any.whl:

Publisher: publish.yml on OHDSI/pyscientificpdfparser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page