Skip to main content

Extract clean, structured text from scientific papers in PDF format

Project description

Science-OCR

Science-OCR is a lightweight, ready-to-use Python package designed to extract clean, structured text from scientific papers in PDF format. It wraps a simple interface around Surya-OCR's layout, text_detection, and text_recognition models, while using PyMuPDF (fitz) to rasterize PDF pages into images for processing.

This tool is ideal for researchers, data scientists, and developers who want reliable OCR extraction from research papers—without dealing with complicated pipelines.

✨ Features

  • 📄 Optimized for scientific PDFs
  • 🔍 High-accuracy OCR with Surya-OCR (layout + detection + recognition)
  • 🧩 Minimal API — only one method to use
  • 🐍 Easy installation via pip
  • 🚀 Zero setup — works out-of-the-box

📦 Installation

pip install science-ocr

No additional configuration required — models load automatically.

🚀 Quick Start

from science_ocr import ScienceOCR

ocr = ScienceOCR()

text = ocr.parse_text(
    path="path/to/paper.pdf",
    first_page=0,      # optional
    last_page=None,    # optional
    dpi=300            # optional
)

print(text)

📘 API Reference

parse_text(self, path, first_page=0, last_page=None, dpi=300)

Extracts OCR text from a PDF.

Parameter Type Default Description
path str Path to the PDF file.
first_page int 0 0-indexed first page to process.
last_page int | None None Last page index (inclusive). If None, processes until the final page.
dpi int 300 Rasterization DPI for PyMuPDF before OCR.

Returns: A single string containing the concatenated OCR text from the selected page range.

🧠 How It Works (Behind the Scenes)

  1. PyMuPDF (fitz) loads the PDF and renders each page at the specified DPI.
  2. Each rendered page image is passed through Surya-OCR:
    • layout model to detect structure
    • text_detection model to find text regions
    • text_recognition model to extract text
  3. Results are merged and returned as clean, readable text.

This hybrid pipeline is optimized for the complex layouts of scientific literature (equations, tables, multi-column layouts, etc.).

📦 Model Weights

This package uses Surya-OCR models that are mirrored on HuggingFace for reliability:

  • Mirror: https://huggingface.co/TomasGD/surya-ocr-mirror-models-2025_05_07

Models are subject to Surya's licensing terms (see MODEL_LICENSE).

🤝 Contributing

Pull requests and suggestions are welcome! If you encounter any issues, please open an issue on the project's repository.

📄 License

Science-OCR is licensed under AGPL-3.0, but depends on:

  • PyMuPDF: AGPL-3.0
  • Surya-OCR: Code is GPL-3.0 & Models are AI Pubs Open Rail-M license (free for research, personal use, and startups under $2M funding/revenue)

For commercial use exceeding $2M funding/revenue, you must obtain commercial licenses for the dependencies.

Disclaimer

The maintainers of Science-OCR are not responsible for ensuring your compliance with third-party licenses. It is your responsibility to review and comply with all applicable licenses for dependencies used in this project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

science_ocr-0.1.0.tar.gz (16.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

science_ocr-0.1.0-py3-none-any.whl (17.3 kB view details)

Uploaded Python 3

File details

Details for the file science_ocr-0.1.0.tar.gz.

File metadata

  • Download URL: science_ocr-0.1.0.tar.gz
  • Upload date:
  • Size: 16.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for science_ocr-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e12c3bfbbd836e7d9ab40e296e6445dd182b2eb1e86b4561fab06e9897f18361
MD5 52f28c7c5467f5b7f1d5ded0a1d6b1a9
BLAKE2b-256 9b856022be66b8821e51886ad8795e586e1ded610444245e036c78c40de99386

See more details on using hashes here.

File details

Details for the file science_ocr-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: science_ocr-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 17.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for science_ocr-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e0b8e2aa37bda75091c3924737abc3ca3a82b57300a0d8e4d377f68f6ca2988c
MD5 87eb7fca132ddbc68e2dd87042c5da72
BLAKE2b-256 c6a51caa656d74e40acc6df2fcebb324ffd90e736e7d6a3644214cdc9c697e76

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page