Skip to main content

RemarkableOCR is a simple ocr tool with improved data, analytics, and rendering tools.

Project description

RemarkableOCR is a simple ocr tool with improved data, analytics, and rendering tools.

RemarkableOCR creates Image-to-Text positional data and analytics for natural language processing on images. RemarkableOCR is based on the Google pytesseract package with additional lightweight processing to make its more user-friendly and expansive data, plus provides one-line simple tools for:

  • especially books, newspapers, screenshots
  • images to debug
  • highlights and in-doc search
  • and redaction.

five-minute demo: data, debug

demo.data.png

from remarkable import RemarkableOCR, colors
from PIL import Image

# Operation Moonglow; annotated by David Bernat
image_filename = "_db/docs/moonglow.jpg"
im = Image.open(image_filename)

##################################################################
#  using data
##################################################################
data = RemarkableOCR.ocr(image_filename)

# we can debug using an image
RemarkableOCR.create_debug_image(im, data).show()

# hey. what are all the c words?
cwords = [d for d in data if "sea" in d["text"].lower()]
cwords = RemarkableOCR.create_debug_image(im, cwords).show()

# nevermind; apply filters because this is a book page
# removes annotations on the edges; which are often numerous
data = RemarkableOCR.filter_assumption_blocks_of_text(data)
margins = [d for d in data if d["is_first_in_line"] or d["is_last_in_line"]]
RemarkableOCR.create_debug_image(im, margins).show()

# transforms data to a space-separated string; adding new-lines at paragraph breaks.
readable = RemarkableOCR.readable_lines(data)

five-minute demo: highlighting

demo.highlighting.jpg

from remarkable import RemarkableOCR, colors
from PIL import Image

# Operation Moonglow; annotated by David Bernat
image_filename = "_db/docs/moonglow.jpg"
im = Image.open(image_filename)

##################################################################
#  using data
##################################################################
data = RemarkableOCR.ocr(image_filename)
data = RemarkableOCR.filter_assumption_blocks_of_text(data)

# to create a highlight bar based on token pixel sizes
# if None will calculate on max/min height of the sequence
base = RemarkableOCR.document_statistics(data)
wm, ws = base["char"]["wm"], base["char"]["ws"]
height_px = wm + 6*ws

# simple search for phrases (lowercase, punctuation removed) returns one result for each four
phrases = ["the Space Age", "US Information Agency", "US State Department", "Neil Armstrong"]
found = RemarkableOCR.find_statements(phrases, data)

# we can highlight these using custom highlights
as_list = list(found.values())  # the start/end only
configs = [dict(highlight_color=colors.starlight),
           dict(highlight_color=colors.green),
           dict(highlight_color=colors.starlight),
           dict(highlight_color=colors.orange, highlight_alpha=0.40),
]

highlight = RemarkableOCR.highlight_statements(im, as_list, data, configs, height_px=height_px)
highlight.show()

# we can redact our secret activities shh :)
phrases = ["I spent the summer reading memos, reports, letters"]
found = RemarkableOCR.find_statements(phrases, data)
as_list = list(found.values())
config = dict(highlight_color=colors.black, highlight_alpha=1.0)
RemarkableOCR.highlight_statements(highlight, as_list, data, config, height_px=height_px).show()

what is all this data?

key value ours description
text US the token text, whitespace removed
conf 0.96541046 confidence score 0 to 1; 0.40 and up is reliable
page_num 1 page number will always be 1 using single images
block_num 13 a page consists of blocks top to bottom, 1 at top
par_num 1 a block consists of paragraphs top to bottom, 1 at top of block
line_num 3 a paragraph consists of lines top to bottom, 1 at top of paragraph
word_num 6 a line consists of words left to right, 1 at the far left
absolute_line_number 26 * line number relative to page as a whole
is_first_in_line False * is the token the left-most in the line?
is_last_in_line False * is the token the right-most in the line?
is_punct False * is every character a punctuation character?
is_alnum True * is every character alphanumeric?
left 1160.0 left-edge pixel value of token bounding box
right 1238.0 * right-edge pixel value of token bounding box
top 2590.0 top-edge pixel value of token bounding box
bottom 2638.0 * bottom-edge pixel value of token bounding box
width 78.0 width pixel value of token bounding box, equal to right minus left
height 48.0 height pixel value of token bounding box; equal to bottom minus top
block_left 116.0 * left-edge of block of token; useful for fixed-width cross-line highlighting
block_right 2195.0 * right-edge of block of token; useful for fixed-width cross-line highlighting
level 5 describes granularity of the token, and will always be 5, indicating a token

RemarkableOCR methods to notice

RemarkableOCR.ocr(filename, confidence_threshold=0.50)  # The core RemarkableOCR functionality returns a dictionary of data about each token detected in the image.
RemarkableOCR.filter_assumption_blocks_of_text(data, confidence_threshold=0.40) # a filter for identifying one solid block of text; like a book page or newspaper without ads in between
RemarkableOCR.readable_lines(data)  # Convenience function to string sequential words to each line; with new lines at breaks; i.e. readable text
RemarkableOCR.document_statistics(data)  # Calculate basic statistics of the document itself; i.e., statistics on the pixel size of the font
RemarkableOCR.create_debug_image(im, data)  # Draws a black bounding box around each token to visually confirm every token was identified correctly.
RemarkableOCR.find_statements(statements, data)  # Uses simple regex to identify exact string matches in sequences of tokens, after string normalization
RemarkableOCR.highlight_statements(im, found, data, config=None, height_px=None)  # Convenience function for highlighting multiple sequences found=Array<[_, start_i, end_i]> using custom config.

Licensing & Stuff

Hey. I took time to build this. There are a lot of pain points that I solved for you, and a lot of afternoons staring outside the coffeeshop window at the sunshine. Not years, because I am a very skilled, competent software engineer. But enough, okay? Use this package. Ask for improvements. Integrate this into your products. Complain when it breaks. Reference the package by company and name. Starlight Remarkable and RemarkableOCR. Email us to let us know!




Starlight LLC
Copyright 2024
All Rights Reserved
GNU GENERAL PUBLIC LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

remarkableocr-0.2.0.tar.gz (16.2 kB view details)

Uploaded Source

Built Distribution

RemarkableOCR-0.2.0-py3-none-any.whl (16.2 kB view details)

Uploaded Python 3

File details

Details for the file remarkableocr-0.2.0.tar.gz.

File metadata

  • Download URL: remarkableocr-0.2.0.tar.gz
  • Upload date:
  • Size: 16.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for remarkableocr-0.2.0.tar.gz
Algorithm Hash digest
SHA256 24ec45a5e8f0a9add513edb2ebb1c2d66c0cb62bc6472cbaf59b338f1c82c930
MD5 dd09e689c7026fdc2e28cd41fcbddb4e
BLAKE2b-256 25cfb4fd6118c4936cd77c3498423effb23531a3c5c9762e2f3f4df7c04a9ce3

See more details on using hashes here.

File details

Details for the file RemarkableOCR-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for RemarkableOCR-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b8dee17fd2edd82b76d9758f086c86f39f05ba6cbb72e768d22e8b54a60439b8
MD5 150cba1b0c2f00cc96bb8fac316c9dad
BLAKE2b-256 f19926a7356626517a3eac8237e9a58f9e41c9b41b0e9e1ae49c7f7cdee9048a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page