Skip to main content

html/ocr parser using Cython/lxml/Tesseract/ImageMagick/Pandas

Project description

html/ocr parser using Cython/lxml/Tesseract/ImageMagick/Pandas

Tested against Windows 10 / Python 3.11 / Anaconda / Windows

pip install xmlhtml2pandas

Cython and a C compiler must be installed!

import os
# Tesseract and ImageMagick must be installed!
os.environ["OMP_THREAD_LIMIT"] = "1"  # to limit the number of threads (tesseract)
os.environ["MAGICK_THREAD_LIMIT"] = "1"  # to limit the number of threads (ImageMagick)
from xmlhtml2pandas import parse_xmlhtml, preprocess_images_and_run_tesseract
from cythondfprint import add_printer  # fast color printer for pandas df

add_printer(1)
for file2parse in [
    r"C:\Users\hansc\Downloads\Apostas Futebol _ Sportingbet.mhtml",
    r"C:\Users\hansc\Downloads\bet365 - Apostas Desportivas Online.mhtml",
    r"C:\Users\hansc\Downloads\bet365 - Apostas Desportivas Online2.mhtml",
]:
    with open(
        file2parse,
        "rb",
    ) as f:
        df_html = parse_xmlhtml(f, "html", ())
        print(df_html)
        print(df_html.dtypes)


for picture in preprocess_images_and_run_tesseract(
    density=200,
    resize_percentage=100,
    tesser_cpus=1,
    image_magick_cpus=1,
    path_in=r"C:\Users\hansc\Desktop\testimg",  # for folders
    path_out=r"C:\Users\hansc\Desktop\testimg_outfiles",  #  for folders
    # path_in=r"C:\Users\hansc\Downloads\apicture.png",# single file
    # path_out=r"C:\Users\hansc\Downloads\afolderforapicture", # single file - folder as output
    magick_options="""-colorspace LinearGray  -normalize -auto-level -alpha deactivate  -adaptive-blur 1 -adaptive-sharpen 1 -trim -fuzz 60 -antialias -auto-gamma -auto-level -black-point-compensation -normalize -enhance -white-balance -antialias -black-threshold 4 -mean-shift 1x5+17%""",
    magick_path=r"C:\Program Files\ImageMagick-7.1.1-Q16-HDRI\magick.exe",
    tesseractpath=r"C:\Program Files\Tesseract-OCR\tesseract.exe",
    tessdata_dir=r"C:\Program Files\Tesseract-OCR\tessdata",
    tesser_options_str="-l por+eng --oem 3 --psm 6 -c tessedit_create_hocr=1 -c hocr_font_info=1 -c tessedit_pageseg_mode=6",
    debug=False,
    subprocess_kwargs_tesser=None,
    subprocess_kwargs_magick=None,
    include_screenshots=True,
):
    print(picture)

# on android 

import os 
import subprocess 
os.environ["OMP_THREAD_LIMIT"] = "1"  
os.environ["MAGICK_THREAD_LIMIT"] = "1" 
os.environ["KMP_ALL_THREADS"] = "1"   
os.environ["KMP_TEAMS_THREAD_LIMIT"] = "1" 
os.environ["OMP_THREAD_LIMIT"] = "1"  
os.environ["KMP_DEVICE_THREAD_LIMIT"] = "1" 

from xmlhtml2pandas import parse_xmlhtml, preprocess_images_and_run_tesseract
subprocess.run("screencap -p > /sdcard/shot.png",shell=True)
for picture in preprocess_images_and_run_tesseract(
    density=200,
    resize_percentage=100,
    tesser_cpus=1,
    image_magick_cpus=1,
    path_in=r"/sdcard/shot.png", 
    path_out=r"/sdcard/Downloadsout",
    magick_options="""-colorspace LinearGray  -normalize -auto-level -alpha deactivate  -adaptive-blur 1 -adaptive-sharpen 1 -trim -fuzz 60 -antialias -auto-gamma -auto-level -black-point-compensation -normalize -enhance -white-balance -antialias -black-threshold 4 -mean-shift 1x5+17%""",
    magick_path=r"/data/data/com.termux/files/usr/bin/magick",
    tesseractpath=r"/data/data/com.termux/files/usr/bin/tesseract",
    tessdata_dir=r"/data/data/com.termux/files/usr/share/tessdata_fast",
    tesser_options_str="-l por+eng --oem 3 --psm 6 -c tessedit_create_hocr=1 -c hocr_font_info=1 -c tessedit_pageseg_mode=6",
    debug=False,
    subprocess_kwargs_tesser=None,
    subprocess_kwargs_magick=None,
    include_screenshots=False,
):
    print(picture)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xmlhtml2pandas-0.14.tar.gz (32.2 kB view details)

Uploaded Source

Built Distribution

xmlhtml2pandas-0.14-py3-none-any.whl (32.6 kB view details)

Uploaded Python 3

File details

Details for the file xmlhtml2pandas-0.14.tar.gz.

File metadata

  • Download URL: xmlhtml2pandas-0.14.tar.gz
  • Upload date:
  • Size: 32.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for xmlhtml2pandas-0.14.tar.gz
Algorithm Hash digest
SHA256 2a886f61968e593475a0f6c9c9ec25cbf3519fb6102744f35d3f548739771d57
MD5 89ce7465fc41973939d6829778109e45
BLAKE2b-256 d4da93ae71fd08abc74d2b85bc7c4b48e99a0baa673628a1207cecda9afd3836

See more details on using hashes here.

File details

Details for the file xmlhtml2pandas-0.14-py3-none-any.whl.

File metadata

File hashes

Hashes for xmlhtml2pandas-0.14-py3-none-any.whl
Algorithm Hash digest
SHA256 3a60b80fd1551256a4df4a1f4906b11ea13a120c17a69ad79a742296252c10da
MD5 d6d6638837c47146475d88ba5fc6a2a9
BLAKE2b-256 40a7671e769ebd4e1dafdb872f618ab3f6f625af8fe6f1470f0f185091ab5ea0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page