Extract clean, structured text from scientific papers in PDF format
Project description
Science-OCR
Science-OCR is a lightweight, ready-to-use Python package designed to extract clean, structured text from scientific papers in PDF format. It wraps a simple interface around Surya-OCR's layout, text_detection, and text_recognition models, while using PyMuPDF (fitz) to rasterize PDF pages into images for processing.
This tool is ideal for researchers, data scientists, and developers who want reliable OCR extraction from research papers—without dealing with complicated pipelines.
✨ Features
- 📄 Optimized for scientific PDFs
- 🔍 High-accuracy OCR with Surya-OCR (layout + detection + recognition)
- 🧩 Minimal API — only one method to use
- 🐍 Easy installation via
pip - 🚀 Zero setup — works out-of-the-box
📦 Installation
pip install science-ocr
No additional configuration required — models load automatically.
🚀 Quick Start
from science_ocr import ScienceOCR
ocr = ScienceOCR()
text = ocr.parse_text(
path="path/to/paper.pdf",
first_page=0, # optional
last_page=None, # optional
dpi=300 # optional
)
print(text)
📘 API Reference
parse_text(self, path, first_page=0, last_page=None, dpi=300)
Extracts OCR text from a PDF.
| Parameter | Type | Default | Description |
|---|---|---|---|
| path | str | — | Path to the PDF file. |
| first_page | int | 0 | 0-indexed first page to process. |
| last_page | int | None | None | Last page index (inclusive). If None, processes until the final page. |
| dpi | int | 300 | Rasterization DPI for PyMuPDF before OCR. |
Returns: A single string containing the concatenated OCR text from the selected page range.
🧠 How It Works (Behind the Scenes)
- PyMuPDF (fitz) loads the PDF and renders each page at the specified DPI.
- Each rendered page image is passed through Surya-OCR:
layoutmodel to detect structuretext_detectionmodel to find text regionstext_recognitionmodel to extract text
- Results are merged and returned as clean, readable text.
This hybrid pipeline is optimized for the complex layouts of scientific literature (equations, tables, multi-column layouts, etc.).
📦 Model Weights
This package uses Surya-OCR models that are mirrored on HuggingFace for reliability:
- Mirror:
https://huggingface.co/TomasGD/surya-ocr-mirror-models-2025_05_07
Models are subject to Surya's licensing terms (see MODEL_LICENSE).
🤝 Contributing
Pull requests and suggestions are welcome! If you encounter any issues, please open an issue on the project's repository.
📄 License
Science-OCR is licensed under AGPL-3.0, but depends on:
- PyMuPDF: AGPL-3.0
- Surya-OCR: Code is GPL-3.0 & Models are AI Pubs Open Rail-M license (free for research, personal use, and startups under $2M funding/revenue)
For commercial use exceeding $2M funding/revenue, you must obtain commercial licenses for the dependencies.
Disclaimer
The maintainers of Science-OCR are not responsible for ensuring your compliance with third-party licenses. It is your responsibility to review and comply with all applicable licenses for dependencies used in this project.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file science_ocr-0.1.0.tar.gz.
File metadata
- Download URL: science_ocr-0.1.0.tar.gz
- Upload date:
- Size: 16.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e12c3bfbbd836e7d9ab40e296e6445dd182b2eb1e86b4561fab06e9897f18361
|
|
| MD5 |
52f28c7c5467f5b7f1d5ded0a1d6b1a9
|
|
| BLAKE2b-256 |
9b856022be66b8821e51886ad8795e586e1ded610444245e036c78c40de99386
|
File details
Details for the file science_ocr-0.1.0-py3-none-any.whl.
File metadata
- Download URL: science_ocr-0.1.0-py3-none-any.whl
- Upload date:
- Size: 17.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e0b8e2aa37bda75091c3924737abc3ca3a82b57300a0d8e4d377f68f6ca2988c
|
|
| MD5 |
87eb7fca132ddbc68e2dd87042c5da72
|
|
| BLAKE2b-256 |
c6a51caa656d74e40acc6df2fcebb324ffd90e736e7d6a3644214cdc9c697e76
|