Extract clean, structured text from scientific papers in PDF format

These details have not been verified by PyPI

Project links

Homepage

Project description

Science-OCR

Science-OCR is a lightweight, ready-to-use Python package designed to extract clean, structured text from scientific papers in PDF format. It wraps a simple interface around Surya-OCR's layout, text_detection, and text_recognition models, while using PyMuPDF (fitz) to rasterize PDF pages into images for processing.

This tool is ideal for researchers, data scientists, and developers who want reliable OCR extraction from research papers—without dealing with complicated pipelines.

✨ Features

📄 Optimized for scientific PDFs
🔍 High-accuracy OCR with Surya-OCR (layout + detection + recognition)
🧩 Minimal API — only one method to use
🐍 Easy installation via pip
🚀 Zero setup — works out-of-the-box

📦 Installation

pip install science-ocr

No additional configuration required — models load automatically.

🚀 Quick Start

from science_ocr import ScienceOCR

ocr = ScienceOCR()

text = ocr.parse_text(
    path="path/to/paper.pdf",
    first_page=0,      # optional
    last_page=None,    # optional
    dpi=300            # optional
)

print(text)

📘 API Reference

`parse_text(self, path, first_page=0, last_page=None, dpi=300)`

Extracts OCR text from a PDF.

Parameter	Type	Default	Description
path	str	—	Path to the PDF file.
first_page	int	0	0-indexed first page to process.
last_page	int \| None	None	Last page index (inclusive). If `None`, processes until the final page.
dpi	int	300	Rasterization DPI for PyMuPDF before OCR.

Returns: A single string containing the concatenated OCR text from the selected page range.

🧠 How It Works (Behind the Scenes)

PyMuPDF (fitz) loads the PDF and renders each page at the specified DPI.
Each rendered page image is passed through Surya-OCR:
- layout model to detect structure
- text_detection model to find text regions
- text_recognition model to extract text
Results are merged and returned as clean, readable text.

This hybrid pipeline is optimized for the complex layouts of scientific literature (equations, tables, multi-column layouts, etc.).

📦 Model Weights

This package uses Surya-OCR models that are mirrored on HuggingFace for reliability:

Mirror: https://huggingface.co/TomasGD/surya-ocr-mirror-models-2025_05_07

Models are subject to Surya's licensing terms (see MODEL_LICENSE).

🤝 Contributing

Pull requests and suggestions are welcome! If you encounter any issues, please open an issue on the project's repository.

📄 License

Science-OCR is licensed under AGPL-3.0, but depends on:

PyMuPDF: AGPL-3.0
Surya-OCR: Code is GPL-3.0 & Models are AI Pubs Open Rail-M license (free for research, personal use, and startups under $2M funding/revenue)

For commercial use exceeding $2M funding/revenue, you must obtain commercial licenses for the dependencies.

Disclaimer

The maintainers of Science-OCR are not responsible for ensuring your compliance with third-party licenses. It is your responsibility to review and comply with all applicable licenses for dependencies used in this project.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.3.0

Jan 13, 2026

This version

0.2.0

Dec 3, 2025

0.1.0

Dec 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

science_ocr-0.2.0.tar.gz (16.1 kB view details)

Uploaded Dec 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

science_ocr-0.2.0-py3-none-any.whl (17.3 kB view details)

Uploaded Dec 3, 2025 Python 3

File details

Details for the file science_ocr-0.2.0.tar.gz.

File metadata

Download URL: science_ocr-0.2.0.tar.gz
Upload date: Dec 3, 2025
Size: 16.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for science_ocr-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`7354c21bf3941afa4d602857447e8ffabd00eb164011b4f144f70e7e498e4e35`
MD5	`0b614ff4a0cc956cdf2aee763fdad329`
BLAKE2b-256	`8c2968db5218b433b861e91396f8ae23c985d91111c74bd8918a1f7677e544a0`

See more details on using hashes here.

File details

Details for the file science_ocr-0.2.0-py3-none-any.whl.

File metadata

Download URL: science_ocr-0.2.0-py3-none-any.whl
Upload date: Dec 3, 2025
Size: 17.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for science_ocr-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4287c698a92f15157c6eb854d61b6687bec6c5a67eba8eb2b436b5c09f7f6d6e`
MD5	`5d1739b3542fcdea6b401fc94359130c`
BLAKE2b-256	`8269aa98ab2a9c67a5d7ed68e473882c337b14e4db0ff2518e69d04587f0dc31`

See more details on using hashes here.

science-ocr 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Science-OCR

✨ Features

📦 Installation

🚀 Quick Start

📘 API Reference

`parse_text(self, path, first_page=0, last_page=None, dpi=300)`

🧠 How It Works (Behind the Scenes)

📦 Model Weights

🤝 Contributing

📄 License

Disclaimer

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes