Skip to main content

Multiprocessing OCR with Tesseract

Project description

Multiprocessing OCR with Tesseract

pip install tesseractmultiprocessing

Worth using if you:

  1. have plenty of different files

  2. are using numpy

Multi: 23.9910116

One CPU: 100.61128 #pytesseract

from tesseractmultiprocessing import tesser2df

from a_cv_imwrite_imread_plus import open_image_in_cv

from time import perf_counter



picslinks = [

    r"https://github.com/hansalemaos/screenshots/raw/main/pandsnesteddicthtml.png",

    r"https://github.com/hansalemaos/screenshots/raw/main/cv2_putTrueTypeText_000000.png",

    r"https://github.com/hansalemaos/screenshots/raw/main/cv2_putTrueTypeText_000008.png",

    r"https://github.com/hansalemaos/screenshots/raw/main/cv2_putTrueTypeText_000017.png",

]

picsunique = [open_image_in_cv(x) for x in picslinks]

pics = []

for _ in range(100):

    pics.extend(picsunique)



start = perf_counter()

output = tesser2df(

    pics,

    language="eng",

    pandas_kwargs={"on_bad_lines": "warn"},

    tesser_args=(),

    cpus=5,

    tesser_path=r"C:\Program Files\Tesseract-OCR\tesseract.exe",

)

print(f"Multi: {perf_counter()-start}")





################################################################################



import pytesseract



pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"





def st():

    alla = []

    for p in pics:

        alla.append(pytesseract.image_to_data(p))

    return alla





start = perf_counter()

output2 = st()

print(f"One CPU: {perf_counter()-start}")





# Multi: 23.9910116

# One CPU: 100.61128



# output[0]

# Out[4]:

# (    level  page_num  block_num  par_num  ...  start_x  start_y  end_x  end_y

#  0       1         1          0        0  ...        0        0   1465    654

#  1       2         1          1        0  ...      322       64    327    540

#  2       3         1          1        1  ...      322       64    327    540

#  3       4         1          1        1  ...      322       64    327    540

#  4       5         1          1        1  ...      322       64    327    540

#  ..    ...       ...        ...      ...  ...      ...      ...    ...    ...

#  60      5         1         11        1  ...       14      633   1448    644

#  61      2         1         12        0  ...     1445       15   1450    639

#  62      3         1         12        1  ...     1445       15   1450    639

#  63      4         1         12        1  ...     1445       15   1450    639

#  64      5         1         12        1  ...     1445       15   1450    639

#

#  [65 rows x 19 columns],

#  array([[[255, 255, 255],

#          [255, 255, 255],

#          [255, 255, 255],

#          ...,

#          [255, 255, 255],

#          [255, 255, 255],

#          [255, 255, 255]],

#

#         [[255, 255, 255],

#          [255, 255, 255],

#          [255, 255, 255],

#          ...,

#          [255, 255, 255],

#          [255, 255, 255],

#          [255, 255, 255]],

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tesseractmultiprocessing-0.10.tar.gz (5.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tesseractmultiprocessing-0.10-py3-none-any.whl (7.3 kB view details)

Uploaded Python 3

File details

Details for the file tesseractmultiprocessing-0.10.tar.gz.

File metadata

  • Download URL: tesseractmultiprocessing-0.10.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for tesseractmultiprocessing-0.10.tar.gz
Algorithm Hash digest
SHA256 7c8ba358a549d25f8439ba7159586d149cc6f1a5f34d7b63f721f51dc01d7d33
MD5 39048d2cf41e8f422279243c0d546e4e
BLAKE2b-256 8f85b5818df606ef38ffba3cfad84fc2db62f9d71c9e96fe78bba39ad901e32c

See more details on using hashes here.

File details

Details for the file tesseractmultiprocessing-0.10-py3-none-any.whl.

File metadata

File hashes

Hashes for tesseractmultiprocessing-0.10-py3-none-any.whl
Algorithm Hash digest
SHA256 014a045bcc22b01414a7111ecebc3f11d1098ec2dd422431d07e139e35b58d5a
MD5 4257237233fe24d6a2bd04861d14ad57
BLAKE2b-256 3c816e6e1c27ab48f33feb0d2a007544e9ed01d9c057959ac9a2eb3d8638a630

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page