Skip to main content

No project description provided

Project description

jupyter_anywidget_tesseract_pdfjs

Tesseract./ pdf.js anywidget for previewing PDF and extracting text from PDF, image, etc. in JupyterLab

Inspired by and building on @simonw's (Simon Willison) OCR tool [about], use tesseract.js in a Jupyter notebook environment via an anywidget wrapper.

Using the anywidget framework, we can essentially load Javascript and WASM models into a sidebar widget and use the widget for "side-processing" using the browser machinery.

For example, we can use the tesseract.js for OCR/text extraction on images, and pdf.js for converting PDF documents to images which can then be OCR'd using tesseract.js.

This reduces the number of Python dependencies that need to be installed on the host machine, albeit at the expense of loading resources into the browser.

I'm not much a packaging expert, so some assets are likely to be loaded from a URI; ideally, everything would be bundled into the anywidget extension.

Installation

pip install jupyter_anywidget_tesseract_pdfjs

Usage

Import the jupyter_anywidget_tesseract_pdfjs package and launch a widget:

from jupyter_anywidget_tesseract_pdfjs import tesseract_panel

t = tesseract_panel()

This loads the widget by default into a new panel using jupyterlab_sidecar.

You can then drag and drop an image file or PDF file onto the landing area or load an image or path in from a notebook code cell.

Load in widget from code, display in panel

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jupyter_anywidget_tesseract_pdfjs-0.0.2.tar.gz (107.9 kB view details)

Uploaded Source

Built Distribution

jupyter_anywidget_tesseract_pdfjs-0.0.2-py2.py3-none-any.whl (110.9 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file jupyter_anywidget_tesseract_pdfjs-0.0.2.tar.gz.

File metadata

File hashes

Hashes for jupyter_anywidget_tesseract_pdfjs-0.0.2.tar.gz
Algorithm Hash digest
SHA256 7578d93351b84e71b6eadd1ef7fa012cffb1080bf0320b831be38184e54ed099
MD5 938b5cf2a0212e5f1bc3d4a3fb47df68
BLAKE2b-256 4fd08784018803c7259ced37487e52266d533188ce6f96dce7984b708a9cf081

See more details on using hashes here.

File details

Details for the file jupyter_anywidget_tesseract_pdfjs-0.0.2-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for jupyter_anywidget_tesseract_pdfjs-0.0.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 410978c8e9298c50d72f6ac65e831fbd055fd532d4bf5458d4dcac7ecb20d09c
MD5 d239bb392fe3b05d89500eeefc4e1599
BLAKE2b-256 21a974e2623eda02048ee474b3b191ca99eba95d48bf11bbcba9f0216d968f23

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page