A Python package to parse scientific PDF documents into structured Markdown.
Project description
pyScientificPdfParser
A Python package to parse scientific PDF documents into structured Markdown, leveraging modern Document AI models.
Overview
This package provides a pipeline to process scientific PDFs (both born-digital and scanned) and convert them into structured, machine-readable formats like GitHub Flavored Markdown and JSON. It uses a series of state-of-the-art models for layout analysis, table recognition, and optional LLM-based refinement.
Features
- PDF Processing: Handles single files, lists of files, or entire directories.
- OCR: Uses Tesseract for robust text extraction from scanned documents.
- Document Layout Analysis (DLA): Employs models like LayoutLMv3 or DiT to identify page regions (title, text, tables, figures).
- Table Structure Recognition (TSR): Utilizes models like Table Transformer to parse the structure of complex tables.
- Section Segmentation: Logically groups content into IMRaD sections.
- LLM Refinement (Optional): Uses Large Language Models for OCR correction, text flow normalization, and structured data extraction (e.g., references).
- Multiple Output Formats: Generates clean Markdown, structured JSON, and extracts image assets.
Installation
Install the package from PyPI:
# Base installation
pip install pyscientificpdfparser
# To include machine learning models for layout analysis and table recognition
pip install pyscientificpdfparser[ml]
# For full functionality, including LLM-based refinement
pip install pyscientificpdfparser[ml,llm]
Note: This package requires a system-level installation of Tesseract for OCR. Please see the full installation guide for details.
Usage
Command-Line Interface (CLI)
scipdfparser process path/to/your/document.pdf --output-dir ./output
Python API
import pathlib
from pyscientificpdfparser.core import parse_pdf
# Define the path to your PDF and the desired output directory
pdf_path = pathlib.Path("path/to/your/document.pdf")
output_dir = pathlib.Path("path/to/output")
# Create the output directory if it doesn't exist
output_dir.mkdir(parents=True, exist_ok=True)
print(f"Processing {pdf_path.name}...")
# Call the parser
# The 'document' object contains all the parsed data.
document = parse_pdf(
pdf_path=pdf_path,
output_dir=output_dir,
llm_refine=False, # Optional: set to True to enable LLM refinement
)
print("Done.")
print(f"Markdown and assets saved to: {output_dir}")
Development
This project uses poetry for dependency management and pre-commit for code quality.
Initially developed https://github.com/gowthamrao/pyScientificPdfParser/tree/develop
# Install development dependencies
poetry install
# Activate pre-commit hooks
poetry run pre-commit install
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyscientificpdfparser-0.2.0.tar.gz.
File metadata
- Download URL: pyscientificpdfparser-0.2.0.tar.gz
- Upload date:
- Size: 19.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aa139f1d43cc7885cc7714b10faadca739b1821477cc11f94599fb254135c547
|
|
| MD5 |
fba9a1320181eabb3144d6de79c4f775
|
|
| BLAKE2b-256 |
6c33a4ce2a2598307b2d74cc50e45f380be5f3efabe4fccc7ea195fe406a1f29
|
Provenance
The following attestation bundles were made for pyscientificpdfparser-0.2.0.tar.gz:
Publisher:
publish.yml on OHDSI/pyscientificpdfparser
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pyscientificpdfparser-0.2.0.tar.gz -
Subject digest:
aa139f1d43cc7885cc7714b10faadca739b1821477cc11f94599fb254135c547 - Sigstore transparency entry: 521162153
- Sigstore integration time:
-
Permalink:
OHDSI/pyscientificpdfparser@4e41f231efb64b413f5df5de27944a71aab9751d -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/OHDSI
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4e41f231efb64b413f5df5de27944a71aab9751d -
Trigger Event:
release
-
Statement type:
File details
Details for the file pyscientificpdfparser-0.2.0-py3-none-any.whl.
File metadata
- Download URL: pyscientificpdfparser-0.2.0-py3-none-any.whl
- Upload date:
- Size: 23.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c04e320b1595f796da037c7e1a325fc1669fefaabd288da0f219d9b380558b04
|
|
| MD5 |
9fe706de5e8c2aa471b9b4b03961b5cb
|
|
| BLAKE2b-256 |
72a1cff1fac8dd7938aacb3a73f66197991e8e531198e00f184bb25386cb7241
|
Provenance
The following attestation bundles were made for pyscientificpdfparser-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on OHDSI/pyscientificpdfparser
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pyscientificpdfparser-0.2.0-py3-none-any.whl -
Subject digest:
c04e320b1595f796da037c7e1a325fc1669fefaabd288da0f219d9b380558b04 - Sigstore transparency entry: 521162159
- Sigstore integration time:
-
Permalink:
OHDSI/pyscientificpdfparser@4e41f231efb64b413f5df5de27944a71aab9751d -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/OHDSI
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4e41f231efb64b413f5df5de27944a71aab9751d -
Trigger Event:
release
-
Statement type: