Skip to main content

Multiprocessing OCR with Tesseract

Project description

Multiprocessing OCR with Tesseract

pip install tesseractmultiprocessing

Worth using if you:

  1. have plenty of different files

  2. are using numpy

Multi: 23.9910116

One CPU: 100.61128 #pytesseract

from tesseractmultiprocessing import tesser2df

from a_cv_imwrite_imread_plus import open_image_in_cv

from time import perf_counter



picslinks = [

    r"https://github.com/hansalemaos/screenshots/raw/main/pandsnesteddicthtml.png",

    r"https://github.com/hansalemaos/screenshots/raw/main/cv2_putTrueTypeText_000000.png",

    r"https://github.com/hansalemaos/screenshots/raw/main/cv2_putTrueTypeText_000008.png",

    r"https://github.com/hansalemaos/screenshots/raw/main/cv2_putTrueTypeText_000017.png",

]

picsunique = [open_image_in_cv(x) for x in picslinks]

pics = []

for _ in range(100):

    pics.extend(picsunique)



start = perf_counter()

output = tesser2df(

    pics,

    language="eng",

    pandas_kwargs={"on_bad_lines": "warn"},

    tesser_args=(),

    cpus=5,

    tesser_path=r"C:\Program Files\Tesseract-OCR\tesseract.exe",

)

print(f"Multi: {perf_counter()-start}")





################################################################################



import pytesseract



pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"





def st():

    alla = []

    for p in pics:

        alla.append(pytesseract.image_to_data(p))

    return alla





start = perf_counter()

output2 = st()

print(f"One CPU: {perf_counter()-start}")





# Multi: 23.9910116

# One CPU: 100.61128



# output[0]

# Out[4]:

# (    level  page_num  block_num  par_num  ...  start_x  start_y  end_x  end_y

#  0       1         1          0        0  ...        0        0   1465    654

#  1       2         1          1        0  ...      322       64    327    540

#  2       3         1          1        1  ...      322       64    327    540

#  3       4         1          1        1  ...      322       64    327    540

#  4       5         1          1        1  ...      322       64    327    540

#  ..    ...       ...        ...      ...  ...      ...      ...    ...    ...

#  60      5         1         11        1  ...       14      633   1448    644

#  61      2         1         12        0  ...     1445       15   1450    639

#  62      3         1         12        1  ...     1445       15   1450    639

#  63      4         1         12        1  ...     1445       15   1450    639

#  64      5         1         12        1  ...     1445       15   1450    639

#

#  [65 rows x 19 columns],

#  array([[[255, 255, 255],

#          [255, 255, 255],

#          [255, 255, 255],

#          ...,

#          [255, 255, 255],

#          [255, 255, 255],

#          [255, 255, 255]],

#

#         [[255, 255, 255],

#          [255, 255, 255],

#          [255, 255, 255],

#          ...,

#          [255, 255, 255],

#          [255, 255, 255],

#          [255, 255, 255]],

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tesseractmultiprocessing-0.10.tar.gz (5.5 kB view details)

Uploaded Source

Built Distribution

tesseractmultiprocessing-0.10-py3-none-any.whl (7.3 kB view details)

Uploaded Python 3

File details

Details for the file tesseractmultiprocessing-0.10.tar.gz.

File metadata

File hashes

Hashes for tesseractmultiprocessing-0.10.tar.gz
Algorithm Hash digest
SHA256 7c8ba358a549d25f8439ba7159586d149cc6f1a5f34d7b63f721f51dc01d7d33
MD5 39048d2cf41e8f422279243c0d546e4e
BLAKE2b-256 8f85b5818df606ef38ffba3cfad84fc2db62f9d71c9e96fe78bba39ad901e32c

See more details on using hashes here.

File details

Details for the file tesseractmultiprocessing-0.10-py3-none-any.whl.

File metadata

File hashes

Hashes for tesseractmultiprocessing-0.10-py3-none-any.whl
Algorithm Hash digest
SHA256 014a045bcc22b01414a7111ecebc3f11d1098ec2dd422431d07e139e35b58d5a
MD5 4257237233fe24d6a2bd04861d14ad57
BLAKE2b-256 3c816e6e1c27ab48f33feb0d2a007544e9ed01d9c057959ac9a2eb3d8638a630

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page