No project description provided
Project description
jupyter_anywidget_tesseract_pdfjs
Tesseract./ pdf.js anywidget
for previewing PDF and extracting text from PDF, image, etc. in JupyterLab
Inspired by and building on @simonw's (Simon Willison) OCR tool [about], use tesseract.js
in a Jupyter notebook environment via an anywidget
wrapper.
Using the anywidget
framework, we can essentially load Javascript and WASM models into a sidebar widget and use the widget for "side-processing" using the browser machinery.
For example, we can use the tesseract.js
for OCR/text extraction on images, and pdf.js
for converting PDF documents to images which can then be OCR'd using tesseract.js
.
This reduces the number of Python dependencies that need to be installed on the host machine, albeit at the expense of loading resources into the browser.
I'm not much a packaging expert, so some assets are likely to be loaded from a URI; ideally, everything would be bundled into the anywidget
extension.
Related blog post: Jupyter tesseract/pdfjs anywidget — sideloaded OCR for Python environments
Installation
pip install jupyter_anywidget_tesseract_pdfjs
Usage
Import the jupyter_anywidget_tesseract_pdfjs
package and launch a widget:
from jupyter_anywidget_tesseract_pdfjs import tesseract_panel
t = tesseract_panel()
#t = tesseract_panel("example panel title)
#t = tesseract_panel(None, "split-bottom")
# We can also render the widget into the output
# of the initiating cell
#from jupyter_anywidget_tesseract_pdfjs import tesseract_inline
#t = tesseract_inline()
# Alternatively, create a "headless" version
# - does not display UI panel
# - BUT still needs to be able to attach widget to DOM
#from jupyter_anywidget_tesseract_pdfjs import tesseract_headless
#t = tesseract_headless()
This loads the widget by default into a new panel using jupyterlab_sidecar
.
You can then drag and drop an image file or PDF file onto the landing area or load an image or path in from a notebook code cell.
Filetype | Local file | Web URL |
---|---|---|
Image | File drag / select; widget.set_datauri(?) |
widget.url=? , widget.set_url(?) , widget.set_datauri(?) |
File drag / select | widget.pdf=? , widget.set_url(?) |
|
Image Data URI | widget.datauri=? |
N/A |
matplotlib axes object |
widget.set_datauri(ax) |
N/A |
IPython Image displayed object |
widget.set_datauri(_) in next run cell |
N/A |
Accessing extracted text
We can access extracted text via: t.pagedata
The results object takes the form:
{'typ': 'pdf',
'pages': 3,
'name': 'sample-3pp.pdf',
'p3': 'elementum. Morbi in ipsum sit ...',
'processed': 2,
'p1': "Created for testing ...”, knowing'
}
The keys of the form pN
are page numbers; the processed
item keeps a count of pages that have been processed; the pages
item is the total number of pages submutted for processing.
We can also review the extracted text for the last processed image: t.extracted
Review a history of files that have been processed: t.history
Examples
See also the notebooks in examples
.
Image at URL:
# Image at URL
image_url = "https://tesseract.projectnaptha.com/img/eng_bw.png"
t.set_datauri(image_url)
#New cell
# We also need to "manually" wait for processing to finish
# before trying to inspect the retrieved data
t.pagedata
# Image at URL
image_url = "https://tesseract.projectnaptha.com/img/eng_bw.png"
t.set_url(image_url)
# Also:
# t.set_url(image_url, True) or t.set_url(image_url, force=True)
# Alternatively: t.url = image_url
#New cell
# We also need to "manually" wait for processing to finish
# before trying to inspect the retrieved data
t.pagedata
Parse local image file:
# Local image
# Save a URL as a local file
import urllib.request
local_image = 'local_file.png'
urllib.request.urlretrieve(image_url, local_image)
t.set_datauri('') # Force a change in the URI
t.set_datauri(local_image)
# Alternatively, to force the repeated OCR:
# t.set_datauri(local_image, True)
# t.set_datauri(local_image, force=True)
#New cell
# We also need to "manually" wait for processing to finish
# before trying to inspect the retrieved data
t.pagedata
Parse online PDF from web URL:
# PDF at URL
pdf_url = "https://pdfobject.com/pdf/sample-3pp.pdf"
t.set_url(pdf_url)
## Alternatively:
# t.pdf = pdf_url
Parse IPython Image
display object:
# Image at URL
from IPython.display import Image
Image(local_image)
#Next run cell
t.set_datauri(_)
Parse matplotlib
axes object:
# matplotlb axes object
import pandas as pd
df = pd.DataFrame({'length': [1.5, 0.5, 1.2, 0.9, 3],
'width': [0.7, 0.2, 0.15, 0.2, 1.1]},
index=['pig', 'rabbit', 'duck', 'chicken', 'horse'])
ax = df.plot(title="DataFrame Plot")
#New cell
t.set_datauri(ax)
View history of OCR lookups:
t.history
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file jupyter_anywidget_tesseract_pdfjs-0.0.6.tar.gz
.
File metadata
- Download URL: jupyter_anywidget_tesseract_pdfjs-0.0.6.tar.gz
- Upload date:
- Size: 109.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 243a80651ac85e7bc1a983565b124fa4ba4f7fdba6439dc785a33d613d03ac54 |
|
MD5 | 0cc4992baa3fea638cf63b22baaf751c |
|
BLAKE2b-256 | 5d41b1fca7731ea654aec0b9371de32f8c56fa468905eaba8b1d27e2e052f300 |
File details
Details for the file jupyter_anywidget_tesseract_pdfjs-0.0.6-py2.py3-none-any.whl
.
File metadata
- Download URL: jupyter_anywidget_tesseract_pdfjs-0.0.6-py2.py3-none-any.whl
- Upload date:
- Size: 112.5 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 29566eb9ff86f2a3f8505805ef3f2ab049fd872cc2a2169b387d4df4e289f497 |
|
MD5 | 6143c11dcb44f293a164d3955b491c82 |
|
BLAKE2b-256 | 45eed173c8e223cf188e5a685bc5820f6ff213eea627292f7f12c36ffbccff0f |