Skip to main content

A Python package to parse scientific PDF documents into structured Markdown.

Project description

pyScientificPdfParser

A Python package to parse scientific PDF documents into structured Markdown, leveraging modern Document AI models.

Overview

This package provides a pipeline to process scientific PDFs (both born-digital and scanned) and convert them into structured, machine-readable formats like GitHub Flavored Markdown and JSON. It uses a series of state-of-the-art models for layout analysis, table recognition, and optional LLM-based refinement.

Features

  • PDF Processing: Handles single files, lists of files, or entire directories.
  • OCR: Uses Tesseract for robust text extraction from scanned documents.
  • Document Layout Analysis (DLA): Employs models like LayoutLMv3 or DiT to identify page regions (title, text, tables, figures).
  • Table Structure Recognition (TSR): Utilizes models like Table Transformer to parse the structure of complex tables.
  • Section Segmentation: Logically groups content into IMRaD sections.
  • LLM Refinement (Optional): Uses Large Language Models for OCR correction, text flow normalization, and structured data extraction (e.g., references).
  • Multiple Output Formats: Generates clean Markdown, structured JSON, and extracts image assets.

Installation

Install the package from PyPI:

# Base installation
pip install pyscientificpdfparser

# To include machine learning models for layout analysis and table recognition
pip install pyscientificpdfparser[ml]

# For full functionality, including LLM-based refinement
pip install pyscientificpdfparser[ml,llm]

Note: This package requires a system-level installation of Tesseract for OCR. Please see the full installation guide for details.

Usage

Command-Line Interface (CLI)

scipdfparser process path/to/your/document.pdf --output-dir ./output

Python API

import pathlib
from pyscientificpdfparser.core import parse_pdf

# Define the path to your PDF and the desired output directory
pdf_path = pathlib.Path("path/to/your/document.pdf")
output_dir = pathlib.Path("path/to/output")

# Create the output directory if it doesn't exist
output_dir.mkdir(parents=True, exist_ok=True)

print(f"Processing {pdf_path.name}...")

# Call the parser
# The 'document' object contains all the parsed data.
document = parse_pdf(
    pdf_path=pdf_path,
    output_dir=output_dir,
    llm_refine=False,  # Optional: set to True to enable LLM refinement
)

print("Done.")
print(f"Markdown and assets saved to: {output_dir}")

Development

This project uses poetry for dependency management and pre-commit for code quality. Initially developed https://github.com/gowthamrao/pyScientificPdfParser/tree/develop

# Install development dependencies
poetry install

# Activate pre-commit hooks
poetry run pre-commit install

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyscientificpdfparser-0.2.0.tar.gz (19.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyscientificpdfparser-0.2.0-py3-none-any.whl (23.6 kB view details)

Uploaded Python 3

File details

Details for the file pyscientificpdfparser-0.2.0.tar.gz.

File metadata

  • Download URL: pyscientificpdfparser-0.2.0.tar.gz
  • Upload date:
  • Size: 19.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyscientificpdfparser-0.2.0.tar.gz
Algorithm Hash digest
SHA256 aa139f1d43cc7885cc7714b10faadca739b1821477cc11f94599fb254135c547
MD5 fba9a1320181eabb3144d6de79c4f775
BLAKE2b-256 6c33a4ce2a2598307b2d74cc50e45f380be5f3efabe4fccc7ea195fe406a1f29

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyscientificpdfparser-0.2.0.tar.gz:

Publisher: publish.yml on OHDSI/pyscientificpdfparser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyscientificpdfparser-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pyscientificpdfparser-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c04e320b1595f796da037c7e1a325fc1669fefaabd288da0f219d9b380558b04
MD5 9fe706de5e8c2aa471b9b4b03961b5cb
BLAKE2b-256 72a1cff1fac8dd7938aacb3a73f66197991e8e531198e00f184bb25386cb7241

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyscientificpdfparser-0.2.0-py3-none-any.whl:

Publisher: publish.yml on OHDSI/pyscientificpdfparser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page