Skip to main content

No project description provided

Project description

jupyter_anywidget_tesseract_pdfjs

Tesseract./ pdf.js anywidget for previewing PDF and extracting text from PDF, image, etc. in JupyterLab

Inspired by and building on @simonw's (Simon Willison) OCR tool [about], use tesseract.js in a Jupyter notebook environment via an anywidget wrapper.

Using the anywidget framework, we can essentially load Javascript and WASM models into a sidebar widget and use the widget for "side-processing" using the browser machinery.

For example, we can use the tesseract.js for OCR/text extraction on images, and pdf.js for converting PDF documents to images which can then be OCR'd using tesseract.js.

This reduces the number of Python dependencies that need to be installed on the host machine, albeit at the expense of loading resources into the browser.

I'm not much a packaging expert, so some assets are likely to be loaded from a URI; ideally, everything would be bundled into the anywidget extension.

Installation

pip install jupyter_anywidget_tesseract_pdfjs

Usage

Import the jupyter_anywidget_tesseract_pdfjs package and launch a widget:

from jupyter_anywidget_tesseract_pdfjs import tesseract_panel

t = tesseract_panel()
#t = tesseract_panel("example panel title)
#t = tesseract_panel(None, "split-bottom")

This loads the widget by default into a new panel using jupyterlab_sidecar.

You can then drag and drop an image file or PDF file onto the landing area or load an image or path in from a notebook code cell.

Load in widget from code, display in panel

Filetype Local file Web URL
Image File drag / select; widget.set_datauri(?) widget.url=?, widget.set_url(?), widget.set_datauri(?)
PDF File drag / select widget.pdf=?, widget.set_url(?)
Image Data URI widget.datauri=? N/A

Accessing extracted text

We can access extracted text via: t.pagedata

The results object takes the form:

{'typ': 'pdf',
 'pages': 3,
 'name': 'sample-3pp.pdf',
 'p3': 'elementum. Morbi in ipsum sit ...',
 'processed': 2,
 'p1': "Created for testing ...”, knowing'
}

The keys of the form pN are page numbers; the processed item keeps a count of pages that have been processed; the pages item is the total number of pages submutted for processing.

We can also review the extracted text for the last processed image: t.extracted

Review a history of files that have been processed: t.history

Exanples

Example code. Also the notebooks in examples:

# Image at URL
image_url = "https://tesseract.projectnaptha.com/img/eng_bw.png"
t.set_datauri(image_url)

#New cell
# We also need to "manually" wait for processing to finish
# before trying to imspect the retrieved data
t.pagedata



# View history of OCR lookups
t.history

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jupyter_anywidget_tesseract_pdfjs-0.0.4.tar.gz (108.8 kB view details)

Uploaded Source

Built Distribution

jupyter_anywidget_tesseract_pdfjs-0.0.4-py2.py3-none-any.whl (111.7 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file jupyter_anywidget_tesseract_pdfjs-0.0.4.tar.gz.

File metadata

File hashes

Hashes for jupyter_anywidget_tesseract_pdfjs-0.0.4.tar.gz
Algorithm Hash digest
SHA256 ee12969baeda9be63873b1e3bcf10638b4f795be2d4e6611641d1c0ed9e74ac0
MD5 6087e1e176ca390e55d55a428d337f5c
BLAKE2b-256 e2e2c5b6ebe6567d6135e20aa9c2eeac5993284252e94d32b8073bff4edb3694

See more details on using hashes here.

File details

Details for the file jupyter_anywidget_tesseract_pdfjs-0.0.4-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for jupyter_anywidget_tesseract_pdfjs-0.0.4-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 d216e07d9e5f2e1d484a45626834e280492c7a912516a5b902ea58ae44114e6b
MD5 d38697658f6147e970102701b2a4ebbf
BLAKE2b-256 94452f08c4b8ad38779322e57fe25968d37e27818bfe1d3bd99bfc44870e3f4e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page