No project description provided

Project description

jupyter_anywidget_tesseract_pdfjs

Tesseract./ pdf.js anywidget for previewing PDF and extracting text from PDF, image, etc. in JupyterLab

Inspired by and building on @simonw's (Simon Willison) OCR tool [about], use tesseract.js in a Jupyter notebook environment or VS Code notebook via an anywidget wrapper.

Using the anywidget framework, we can essentially load Javascript and WASM models into a sidebar widget and use the widget for "side-processing" using the browser / electron machinery.

For example, we can use the tesseract.js for OCR/text extraction on images, and pdf.js for converting PDF documents to images which can then be OCR'd using tesseract.js.

This reduces the number of Python dependencies that need to be installed on the host machine, albeit at the expense of loading resources into the browser.

I'm not much a packaging expert, so some assets are likely to be loaded from a URI; ideally, everything would be bundled into the anywidget extension.

Installation

pip install jupyter_anywidget_tesseract_pdfjs

Usage

Import the jupyter_anywidget_tesseract_pdfjs package and launch a widget:

from jupyter_anywidget_tesseract_pdfjs import tesseract_panel

t = tesseract_panel()
#t = tesseract_panel("example panel title)
#t = tesseract_panel(None, "split-bottom")

# We can also render the widget into the output
# of the initiating cell
#from jupyter_anywidget_tesseract_pdfjs import tesseract_inline
#t = tesseract_inline()

# Alternatively, create a "headless" version
# - does not display UI panel
# - BUT still needs to be able to attach widget to DOM
#from jupyter_anywidget_tesseract_pdfjs import tesseract_headless
#t = tesseract_headless()

This loads the widget by default into a new panel using jupyterlab_sidecar.

For use in VS Code, use either tesseract_inline() or tesseract_headless().

You can then drag and drop an image file or PDF file onto the landing area or load an image or path in from a notebook code cell.

Load in widget from code, display in panel

Load in widget from code, display in panel, ocr passed image

Filetype	Local file	Web URL
Image	File drag / select; `widget.set_datauri(?)`	`widget.url=?`, `widget.set_url(?)`, `widget.set_datauri(?)`
PDF	File drag / select	`widget.pdf=?`, `widget.set_url(?)`
Image Data URI	`widget.datauri=?`	N/A
`matplotlib` axes object	`widget.set_datauri(ax)`	N/A
IPython `Image` displayed object	`widget.set_datauri(_)` in next run cell	N/A

Accessing extracted text

We can access extracted text via: t.pagedata

The results object takes the form:

{'typ': 'pdf',
 'pages': 3,
 'name': 'sample-3pp.pdf',
 'p3': 'elementum. Morbi in ipsum sit ...',
 'processed': 2,
 'p1': "Created for testing ...”, knowing'
}

The keys of the form pN are page numbers; the processed item keeps a count of pages that have been processed; the pages item is the total number of pages submutted for processing.

We can also review the extracted text for the last processed image: t.extracted

Review a history of files that have been processed: t.history

Examples

See also the notebooks in examples.

Image at URL:

# Image at URL
image_url = "https://tesseract.projectnaptha.com/img/eng_bw.png"
t.set_datauri(image_url)

#New cell
# We also need to "manually" wait for processing to finish
# before trying to inspect the retrieved data
t.pagedata

# Image at URL
image_url = "https://tesseract.projectnaptha.com/img/eng_bw.png"
t.set_url(image_url)
# Also:
# t.set_url(image_url, True) or t.set_url(image_url, force=True)
# Alternatively: t.url = image_url

#New cell
# We also need to "manually" wait for processing to finish
# before trying to inspect the retrieved data
t.pagedata

Parse local image file:

# Local image
# Save a URL as a local file
import urllib.request
local_image = 'local_file.png'
urllib.request.urlretrieve(image_url, local_image)

t.set_datauri('') # Force a change in the URI
t.set_datauri(local_image)

# Alternatively, to force the repeated OCR:
# t.set_datauri(local_image, True)
# t.set_datauri(local_image, force=True)

#New cell
# We also need to "manually" wait for processing to finish
# before trying to inspect the retrieved data
t.pagedata

Parse online PDF from web URL:

# PDF at URL
pdf_url = "https://pdfobject.com/pdf/sample-3pp.pdf"
t.set_url(pdf_url)
## Alternatively:
# t.pdf = pdf_url

Parse IPython Image display object:

# Image at URL
from IPython.display import Image
Image(local_image)

#Next run cell
t.set_datauri(_)

Parse matplotlib axes object:

# matplotlb axes object
import pandas as pd
df = pd.DataFrame({'length': [1.5, 0.5, 1.2, 0.9, 3],
                  'width': [0.7, 0.2, 0.15, 0.2, 1.1]},
                  index=['pig', 'rabbit', 'duck', 'chicken', 'horse'])
ax = df.plot(title="DataFrame Plot")

#New cell
t.set_datauri(ax)

View history of OCR lookups:

t.history

Project details

Release history Release notifications | RSS feed

This version

0.0.8

Dec 2, 2024

0.0.7

Dec 2, 2024

0.0.6

Aug 23, 2024

0.0.5

Aug 21, 2024

0.0.4

Aug 13, 2024

0.0.2

Aug 12, 2024

0.0.1

Aug 12, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jupyter_anywidget_tesseract_pdfjs-0.0.8.tar.gz (110.1 kB view details)

Uploaded Dec 2, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

jupyter_anywidget_tesseract_pdfjs-0.0.8-py2.py3-none-any.whl (113.1 kB view details)

Uploaded Dec 2, 2024 Python 2Python 3

File details

Details for the file jupyter_anywidget_tesseract_pdfjs-0.0.8.tar.gz.

File metadata

Download URL: jupyter_anywidget_tesseract_pdfjs-0.0.8.tar.gz
Upload date: Dec 2, 2024
Size: 110.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.11.0

File hashes

Hashes for jupyter_anywidget_tesseract_pdfjs-0.0.8.tar.gz
Algorithm	Hash digest
SHA256	`da4af6599f8881db4a53452dbfdd243e57d9193a3a6c1952ab8b6d754df38e6b`
MD5	`b090e640738f01bba271b12d1c780897`
BLAKE2b-256	`c702e9b80108557f83084c364f77bb613561ff605db394cde0481e064711a0eb`

See more details on using hashes here.

File details

Details for the file jupyter_anywidget_tesseract_pdfjs-0.0.8-py2.py3-none-any.whl.

File metadata

Download URL: jupyter_anywidget_tesseract_pdfjs-0.0.8-py2.py3-none-any.whl
Upload date: Dec 2, 2024
Size: 113.1 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.11.0

File hashes

Hashes for jupyter_anywidget_tesseract_pdfjs-0.0.8-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`174a7db594614eab4c00eecd55339a4651f91e1aa5585fa60890304aec18f776`
MD5	`9f35ae75ad05488d16bae207fd12cc91`
BLAKE2b-256	`19c3c0ae0d756acbadbdf754b427d6e2024fcc1b0e0d9ec55597d77728d58109`

See more details on using hashes here.

jupyter-anywidget-tesseract-pdfjs 0.0.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

jupyter_anywidget_tesseract_pdfjs

Installation

Usage

Accessing extracted text

Examples

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes