Skip to main content

Extract clean, structured text from scientific papers in PDF format

Project description

Science-OCR

Science-OCR is a lightweight, ready-to-use Python package designed to extract clean, structured text from scientific papers in PDF format. It wraps a simple interface around Surya-OCR's layout, text_detection, and text_recognition models, while using PyMuPDF (fitz) to rasterize PDF pages into images for processing.

This tool is ideal for researchers, data scientists, and developers who want reliable OCR extraction from research papers—without dealing with complicated pipelines.

✨ Features

  • 📄 Optimized for scientific PDFs
  • 🔍 High-accuracy OCR with Surya-OCR (layout + detection + recognition)
  • 🧩 Minimal API — only one method to use
  • 🐍 Easy installation via pip
  • 🚀 Zero setup — works out-of-the-box

📦 Installation

pip install science-ocr

No additional configuration required — models load automatically.

🚀 Quick Start

from science_ocr import ScienceOCR

ocr = ScienceOCR()

text = ocr.parse_text(
    path="path/to/paper.pdf",
    first_page=0,      # optional
    last_page=None,    # optional
    dpi=300            # optional
)

print(text)

📘 API Reference

class ScienceOCR(use_gpu=True)

Initializes the OCR engine.

Parameter Type Default Description
use_gpu bool True If True, uses GPU if available. If False, forces CPU usage, which may be slower but more stable on some systems and avoid memory issues.

parse_text(self, path, first_page=0, last_page=None, dpi=300)

Extracts OCR text from a PDF.

Parameter Type Default Description
path str Path to the PDF file.
first_page int 0 0-indexed first page to process.
last_page int | None None Last page index (inclusive). If None, processes until the final page.
dpi int 300 Rasterization DPI for PyMuPDF before OCR.

Returns: A single string containing the concatenated OCR text from the selected page range.

🧠 How It Works (Behind the Scenes)

  1. PyMuPDF (fitz) loads the PDF and renders each page at the specified DPI.
  2. Each rendered page image is passed through Surya-OCR:
    • layout model to detect structure
    • text_detection model to find text regions
    • text_recognition model to extract text
  3. Results are merged and returned as clean, readable text.

This hybrid pipeline is optimized for the complex layouts of scientific literature (equations, tables, multi-column layouts, etc.).

📦 Model Weights

This package uses Surya-OCR models that are mirrored on HuggingFace for reliability:

  • Mirror: https://huggingface.co/TomasGD/surya-ocr-mirror-models-2025_05_07

Models are subject to Surya's licensing terms (see MODEL_LICENSE).

🤝 Contributing

Pull requests and suggestions are welcome! If you encounter any issues, please open an issue on the project's repository.

📄 License

Science-OCR is licensed under AGPL-3.0, but depends on:

  • PyMuPDF: AGPL-3.0
  • Surya-OCR: Code is GPL-3.0 & Models are AI Pubs Open Rail-M license (free for research, personal use, and startups under $2M funding/revenue)

For commercial use exceeding $2M funding/revenue, you must obtain commercial licenses for the dependencies.

Disclaimer

The maintainers of Science-OCR are not responsible for ensuring your compliance with third-party licenses. It is your responsibility to review and comply with all applicable licenses for dependencies used in this project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

science_ocr-0.3.0.tar.gz (16.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

science_ocr-0.3.0-py3-none-any.whl (17.5 kB view details)

Uploaded Python 3

File details

Details for the file science_ocr-0.3.0.tar.gz.

File metadata

  • Download URL: science_ocr-0.3.0.tar.gz
  • Upload date:
  • Size: 16.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for science_ocr-0.3.0.tar.gz
Algorithm Hash digest
SHA256 d5f081669ca1ed5108cac4e6f0778ddeb677aab426f64bdc4b1c437113f6fc26
MD5 d920e482821d03feb1624a7d6d964897
BLAKE2b-256 52e63bfca269c29fb206dc0d2303f2eb5737ca8749fb36a8d1e7ca6d119d93d5

See more details on using hashes here.

File details

Details for the file science_ocr-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: science_ocr-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 17.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for science_ocr-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 be90bad0d08f7d8bdf9574c4aa4dd83df61625f49b19c9f308e94ef70ec24239
MD5 f83d8c3fc6940b01f20f501cb1a4ae2c
BLAKE2b-256 574bd0f4f082d82677f22d54974ba3b211c5abd4a569dfd24e9a12ad8e8fd958

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page