Words and textboxes made amazing

These details have not been verified by PyPI

Project links

Project description

WordMaze

Words and textboxes made amazing.

About

WordMaze is a standardized format for text extracted from documents.

When designing OCR engines, developers have to decide how to give their clients the list of extracted textboxes, including their position in the page, the text they contain and the confidence associated with that extraction.

Many patterns arise in the wild, for instance:

(x1, x2, y1, y2, text, confidence) # a flat tuple
((x1, y1), (x2, y2), text, confidence) # nested tuples
{'x1': x1, 'x2': x2, 'y1': y1, 'y2': y2, 'text': text, 'confidence': confidence} # a dict
{'x': x1, 'y': y1, 'w': width, 'h': height, 'text': text, 'conf': confidence} # another dict
... # and many others

With WordMaze, textboxes are defined using a unified interface:

from wordmaze import TextBox

textbox = TextBox(
	x1=x1,
	x2=x2,
	y1=y1,
	y2=y2,
	text=text,
	confidence=confidence
)
# or
textbox = TextBox(
	x1=x,
	width=w,
	y1=y,
	height=h,
	text=text,
	confidence=conf
)

Usage

Perhaps the best example of usage is pdfmap.PDFMaze, the first application of WordMaze in a public repository.

The exact expected behaviour of every piece of code in WordMaze can be checked out at the tests folder.

There are three main groups of objects defined in WordMaze:

Textboxes

`Box`es

The first and most fundamental (data)class is the Box, which contains only positional information of a textbox inside a document's page:

from wordmaze import Box

box1 = Box(x1=3, x2=14, y1=15, y2=92) # using coordinates
box2 = Box(x1=3, width=11, y1=15, height=77) # using coordinates and sizes
box3 = Box(x1=3, x2=14, y2=92, height=77) # mixing everything

We enforce x1<=x2 and y1<=y2 (if x1>x2, for instance, their values are automatically swapped upon initialization). Whether (y1, y2) means (top, bottom) or (bottom, top) depends on the context.

Boxes have some interesting attributes to facilitate further calculation using them:

from wordmaze import Box

box = Box(x1=1, x2=3, y1=10, y2=22)
# coordinates:
print(box.x1) # 1
print(box.x2) # 3
print(box.y1) # 10
print(box.y2) # 22
# sizes:
print(box.height) # 12 
print(box.width) # 2
# midpoints:
print(box.xmid) # 2
print(box.ymid) # 16

`Textbox`es

To include textual information in a textbox, use a TextBox:

from wordmaze import TextBox

textbox = TextBox(
	# Box arguments:
	x1=3,
	x2=14,
	y1=15,
	height=77,
	# textual content:
	text='Dr. White.',
	# confidence with which this text was extracted:
	confidence=0.85 # 85% confidence
)

Note that TextBoxes inherit from Boxes, so you can inspect .x1, .width and so on as shown previously. Moreover, you have two more properties:

# textbox from the previous example
print(textbox.text) # Dr. White.
print(textbox.confidence) # 0.85

`PageTextBox`es

If you also wish to include the page number from which your textbox was extracted, you can use a PageTextBox:

from wordmaze import PageTextBox

textbox = PageTextBox(
	# TextBox arguments:
	x1=2,
	x2=10,
	y1=5,
	height=20,
	text='Sichermann and Sichelero and the same person!',
	confidence=0.6,
	# page info:
	page=3 # this textbox was extracted from the 4th page of the document
)
print(textbox.page) # 3

Note that page counting starts from 0 as is common in Python, so that page #3 is the 4th page of the document.

Pages

The basics

Pages are a representation of a document's page. They contain information regarding their size, their coordinate system's origin and their textboxes. For instance:

from wordmaze import Page, Shape, Origin

page = Page(
	shape=Shape(height=210, width=297), # A4 page size in mm
	origin=Origin.TOP_LEFT
)
print(page.shape.height) # 210
print(page.shape.width) # 297
print(page.origin) # Origin.TOP_LEFT

A Page is a MutableSequence of TextBoxes:

page = Page(
	shape=Shape(height=210, width=297), # A4 page size in mm
	origin=Origin.TOP_LEFT,
	entries=[ # define textboxes at initialization
		TextBox(...),
		TextBox(...),
		...
	]
)

page.append(TextBox(...)) # list-like append

for textbox in page: # iteration
	assert isinstance(textbox, TextBox)

print(page[3]) # 4th textbox

Different origins

There are two Origins your page may have:

Origin.TOP_LEFT: y==0 means top, y==page.shape.height means bottom;
Origin.BOTTOM_LEFT: y==0 means bottom, y==page.shape.height means top;

If one textbox provider returned textboxes in Origin.BOTTOM_LEFT coordinates, but you'd like to have them in Origin.TOP_LEFT coordinates, you can use Page.rebase as follows:

bad_page = Page(
	shape=Shape(width=10, height=10),
	origin=Origin.BOTTOM_LEFT,
	entries=[
		TextBox(
			x1=2,
			x2=3,
			y1=7,
			y2=8,
			text='Lofi defi',
			confidence=0.99
		)
	]
)

nice_page = bad_page.rebase(Origin.TOP_LEFT)
assert nice_page.shape == bad_page.shape # rebasing preserves page shape
print(nice_page[0].y1, nice_page[0].y2) # 2 3

Transforming and filtering `TextBox`es

You can easily modify and filter out TextBoxes contained in a Page using Page.map and Page.filter, which behave like map and filter where the iterable is fixed and equal to the page's textboxes:

page = Page(...)

def pad(textbox: TextBox, horizontal, vertical) -> TextBox:
	return TextBox(
		x1=textbox.x1 - horizontal,
		x2=textbox.x2 + horizontal,
		y1=textbox.y1 - vertical,
		y2=textbox.y2 + vertical,
		text=textbox.text,
		confidence=textbox.confidence
	)

# get a new page with textboxes padded by 3 to the left and to the right
# and by 5 to the top and to the bottom
padded_page = page.map(lambda textbox: pad(textbox, horizontal=3, vertical=5))

# filters out textboxes with low confidence
good_page = padded_page.filter(lambda textbox: textbox.confidence >= 0.25)

Page.map and Page.filter also accept keywords. Each keyword accepts a function that accepts the respective property and operates on it. Better shown in code. The previous padding and filtering can be equivalently written as:

# get a new page with textboxes padded by 3 to the left and to the right
# and by 5 to the top and to the bottom
padded_page = page.map(
	x1=lambda x1: x1-3,
	x2=lambda x2: x2+3,
	y1=lambda y1: y1-5,
	y2=lambda y2: y2+5,
)

# filters out textboxes with low confidence
good_page = padded_page.filter(confidence=lambda conf: conf >= 0.25)

`tuple`s and `dict`s

You can also convert page's textboxes to tuples or dicts with Page.tuples and Page.dicts:

page = Page(...)
for tpl in page.tuples():
	# prints a tuple in the form
	# (x1, x2, y1, y2, text, confidence)
	print(tpl)

for dct in page.dicts():
	# prints a dict in the form
	# {'x1': x1, 'x2': x2, 'y1': y1, 'y2': y2, 'text': text, 'confidence': confidence}
	print(dct)

`WordMaze`s

The top-level class from WordMaze is, of course, a WordMaze. WordMazes are simply sequences of Pages:

from wordmaze import WordMaze

wm = WordMaze([
	Page(...),
	Page(...),
	...
])

for page in wm: # iterating
	print(page.shape)

first_page = wm[0] # indexing

WordMaze objects also provide a WordMaze.map and a WordMaze.filter functions, which work the same thing that Page.map and Page.filter do.

If you wish to access WordMaze's pages shapes, there is the property WordMaze.shapes, which is a tuple satisfying wm.shapes[N] == wm[N].shape.

Additionally, you can iterate over WordMaze's textboxes in two ways:

wm = WordMaze(...)

# 1
for page in wm:
	for textbox in page:
		print(textbox)

# 2
for textbox in wm.textboxes():
	print(textbox)

The main difference between #1 and #2 is that the textboxes in #1 are instances of TextBox, whereas the ones in #2 are PageTextBoxes including their containing page index.

WordMaze objects also have a WordMaze.tuples and a WordMaze.dicts which behave just like their Page counterpart except that they also return their page's number:

wm = WordMaze(...)
for tpl in wm.tuples():
	# prints a tuple in the form
	# (x1, x2, y1, y2, text, confidence, page_number)
	print(tpl)

for dct in wm.dicts():
	# prints a dict in the form
	# {'x1': x1, 'x2': x2, 'y1': y1, 'y2': y2, 'text': text, 'confidence': confidence, 'page': page_number}
	print(dct)

Installing

Install WordMaze from PyPI:

pip install wordmaze

Projects using WordMaze

elint-tech/pdfmap: easily extract textboxes from PDF files.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.6

Jun 4, 2021

0.3.5

May 22, 2021

0.3.4

May 21, 2021

0.3.3

May 20, 2021

0.3.2

May 7, 2021

0.3.1rc1 pre-release

Apr 27, 2021

0.2.2

Apr 22, 2021

0.2.1

Apr 21, 2021

0.1.0

Apr 21, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wordmaze-0.3.6.tar.gz (12.7 kB view details)

Uploaded Jun 4, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

wordmaze-0.3.6-py3-none-any.whl (9.3 kB view details)

Uploaded Jun 4, 2021 Python 3

File details

Details for the file wordmaze-0.3.6.tar.gz.

File metadata

Download URL: wordmaze-0.3.6.tar.gz
Upload date: Jun 4, 2021
Size: 12.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for wordmaze-0.3.6.tar.gz
Algorithm	Hash digest
SHA256	`5ed8ca666399aa43e6d6d54e3f5bf7064a2f3ea4c583c13bf3b9fc2ac243a3af`
MD5	`b715d86924478366acf9514cf818613e`
BLAKE2b-256	`2eac5943d5f0c87177d1fb54629c07e63a319a006dea667f90538c6a8126bd9c`

See more details on using hashes here.

File details

Details for the file wordmaze-0.3.6-py3-none-any.whl.

File metadata

Download URL: wordmaze-0.3.6-py3-none-any.whl
Upload date: Jun 4, 2021
Size: 9.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for wordmaze-0.3.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`979f5e9f6dd262e6c4365af7154778d38b8d27e6063efa629442673dfa86a6be`
MD5	`9d17e9e3ee70ca5cd3f5863523199d9d`
BLAKE2b-256	`9620ede4f878003b5afeabf0f64491540137342da1a3c0a9c0f8a44637cba0a0`

See more details on using hashes here.

wordmaze 0.3.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

WordMaze

About

Usage

Textboxes

Boxes

Textboxes

PageTextBoxes

Pages

The basics

Different origins

Transforming and filtering TextBoxes

tuples and dicts

WordMazes

Installing

Projects using WordMaze

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`Box`es

`Textbox`es

`PageTextBox`es

Transforming and filtering `TextBox`es

`tuple`s and `dict`s

`WordMaze`s