Skip to main content

Extract clean, structured text from scientific papers in PDF format

Project description

Science-OCR

Science-OCR is a lightweight, ready-to-use Python package designed to extract clean, structured text from scientific papers in PDF format. It wraps a simple interface around Surya-OCR's layout, text_detection, and text_recognition models, while using PyMuPDF (fitz) to rasterize PDF pages into images for processing.

This tool is ideal for researchers, data scientists, and developers who want reliable OCR extraction from research papers—without dealing with complicated pipelines.

✨ Features

  • 📄 Optimized for scientific PDFs
  • 🔍 High-accuracy OCR with Surya-OCR (layout + detection + recognition)
  • 🧩 Minimal API — only one method to use
  • 🐍 Easy installation via pip
  • 🚀 Zero setup — works out-of-the-box

📦 Installation

pip install science-ocr

No additional configuration required — models load automatically.

🚀 Quick Start

from science_ocr import ScienceOCR

ocr = ScienceOCR()

text = ocr.parse_text(
    path="path/to/paper.pdf",
    first_page=0,      # optional
    last_page=None,    # optional
    dpi=300            # optional
)

print(text)

📘 API Reference

parse_text(self, path, first_page=0, last_page=None, dpi=300)

Extracts OCR text from a PDF.

Parameter Type Default Description
path str Path to the PDF file.
first_page int 0 0-indexed first page to process.
last_page int | None None Last page index (inclusive). If None, processes until the final page.
dpi int 300 Rasterization DPI for PyMuPDF before OCR.

Returns: A single string containing the concatenated OCR text from the selected page range.

🧠 How It Works (Behind the Scenes)

  1. PyMuPDF (fitz) loads the PDF and renders each page at the specified DPI.
  2. Each rendered page image is passed through Surya-OCR:
    • layout model to detect structure
    • text_detection model to find text regions
    • text_recognition model to extract text
  3. Results are merged and returned as clean, readable text.

This hybrid pipeline is optimized for the complex layouts of scientific literature (equations, tables, multi-column layouts, etc.).

📦 Model Weights

This package uses Surya-OCR models that are mirrored on HuggingFace for reliability:

  • Mirror: https://huggingface.co/TomasGD/surya-ocr-mirror-models-2025_05_07

Models are subject to Surya's licensing terms (see MODEL_LICENSE).

🤝 Contributing

Pull requests and suggestions are welcome! If you encounter any issues, please open an issue on the project's repository.

📄 License

Science-OCR is licensed under AGPL-3.0, but depends on:

  • PyMuPDF: AGPL-3.0
  • Surya-OCR: Code is GPL-3.0 & Models are AI Pubs Open Rail-M license (free for research, personal use, and startups under $2M funding/revenue)

For commercial use exceeding $2M funding/revenue, you must obtain commercial licenses for the dependencies.

Disclaimer

The maintainers of Science-OCR are not responsible for ensuring your compliance with third-party licenses. It is your responsibility to review and comply with all applicable licenses for dependencies used in this project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

science_ocr-0.2.0.tar.gz (16.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

science_ocr-0.2.0-py3-none-any.whl (17.3 kB view details)

Uploaded Python 3

File details

Details for the file science_ocr-0.2.0.tar.gz.

File metadata

  • Download URL: science_ocr-0.2.0.tar.gz
  • Upload date:
  • Size: 16.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for science_ocr-0.2.0.tar.gz
Algorithm Hash digest
SHA256 7354c21bf3941afa4d602857447e8ffabd00eb164011b4f144f70e7e498e4e35
MD5 0b614ff4a0cc956cdf2aee763fdad329
BLAKE2b-256 8c2968db5218b433b861e91396f8ae23c985d91111c74bd8918a1f7677e544a0

See more details on using hashes here.

File details

Details for the file science_ocr-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: science_ocr-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 17.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for science_ocr-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4287c698a92f15157c6eb854d61b6687bec6c5a67eba8eb2b436b5c09f7f6d6e
MD5 5d1739b3542fcdea6b401fc94359130c
BLAKE2b-256 8269aa98ab2a9c67a5d7ed68e473882c337b14e4db0ff2518e69d04587f0dc31

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page