html/ocr parser using Cython/lxml/Tesseract/ImageMagick/Pandas
Project description
html/ocr parser using Cython/lxml/Tesseract/ImageMagick/Pandas
Tested against Windows 10 / Python 3.11 / Anaconda / Windows
pip install xmlhtml2pandas
Cython and a C compiler must be installed!
import os
# Tesseract and ImageMagick must be installed!
os.environ["OMP_THREAD_LIMIT"] = "1" # to limit the number of threads (tesseract)
os.environ["MAGICK_THREAD_LIMIT"] = "1" # to limit the number of threads (ImageMagick)
from xmlhtml2pandas import parse_xmlhtml, preprocess_images_and_run_tesseract
from cythondfprint import add_printer # fast color printer for pandas df
add_printer(1)
for file2parse in [
r"C:\Users\hansc\Downloads\Apostas Futebol _ Sportingbet.mhtml",
r"C:\Users\hansc\Downloads\bet365 - Apostas Desportivas Online.mhtml",
r"C:\Users\hansc\Downloads\bet365 - Apostas Desportivas Online2.mhtml",
]:
with open(
file2parse,
"rb",
) as f:
df_html = parse_xmlhtml(f, "html", ())
print(df_html)
print(df_html.dtypes)
for picture in preprocess_images_and_run_tesseract(
density=200,
resize_percentage=100,
tesser_cpus=1,
image_magick_cpus=1,
path_in=r"C:\Users\hansc\Desktop\testimg", # for folders
path_out=r"C:\Users\hansc\Desktop\testimg_outfiles", # for folders
# path_in=r"C:\Users\hansc\Downloads\apicture.png",# single file
# path_out=r"C:\Users\hansc\Downloads\afolderforapicture", # single file - folder as output
magick_options="""-colorspace LinearGray -normalize -auto-level -alpha deactivate -adaptive-blur 1 -adaptive-sharpen 1 -trim -fuzz 60 -antialias -auto-gamma -auto-level -black-point-compensation -normalize -enhance -white-balance -antialias -black-threshold 4 -mean-shift 1x5+17%""",
magick_path=r"C:\Program Files\ImageMagick-7.1.1-Q16-HDRI\magick.exe",
tesseractpath=r"C:\Program Files\Tesseract-OCR\tesseract.exe",
tessdata_dir=r"C:\Program Files\Tesseract-OCR\tessdata",
tesser_options_str="-l por+eng --oem 3 --psm 6 -c tessedit_create_hocr=1 -c hocr_font_info=1 -c tessedit_pageseg_mode=6",
debug=False,
subprocess_kwargs_tesser=None,
subprocess_kwargs_magick=None,
include_screenshots=True,
):
print(picture)
# on android
import os
import subprocess
os.environ["OMP_THREAD_LIMIT"] = "1"
os.environ["MAGICK_THREAD_LIMIT"] = "1"
os.environ["KMP_ALL_THREADS"] = "1"
os.environ["KMP_TEAMS_THREAD_LIMIT"] = "1"
os.environ["OMP_THREAD_LIMIT"] = "1"
os.environ["KMP_DEVICE_THREAD_LIMIT"] = "1"
from xmlhtml2pandas import parse_xmlhtml, preprocess_images_and_run_tesseract
subprocess.run("screencap -p > /sdcard/shot.png",shell=True)
for picture in preprocess_images_and_run_tesseract(
density=200,
resize_percentage=100,
tesser_cpus=1,
image_magick_cpus=1,
path_in=r"/sdcard/shot.png",
path_out=r"/sdcard/Downloadsout",
magick_options="""-colorspace LinearGray -normalize -auto-level -alpha deactivate -adaptive-blur 1 -adaptive-sharpen 1 -trim -fuzz 60 -antialias -auto-gamma -auto-level -black-point-compensation -normalize -enhance -white-balance -antialias -black-threshold 4 -mean-shift 1x5+17%""",
magick_path=r"/data/data/com.termux/files/usr/bin/magick",
tesseractpath=r"/data/data/com.termux/files/usr/bin/tesseract",
tessdata_dir=r"/data/data/com.termux/files/usr/share/tessdata_fast",
tesser_options_str="-l por+eng --oem 3 --psm 6 -c tessedit_create_hocr=1 -c hocr_font_info=1 -c tessedit_pageseg_mode=6",
debug=False,
subprocess_kwargs_tesser=None,
subprocess_kwargs_magick=None,
include_screenshots=False,
):
print(picture)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
xmlhtml2pandas-0.14.tar.gz
(32.2 kB
view details)
Built Distribution
File details
Details for the file xmlhtml2pandas-0.14.tar.gz
.
File metadata
- Download URL: xmlhtml2pandas-0.14.tar.gz
- Upload date:
- Size: 32.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2a886f61968e593475a0f6c9c9ec25cbf3519fb6102744f35d3f548739771d57 |
|
MD5 | 89ce7465fc41973939d6829778109e45 |
|
BLAKE2b-256 | d4da93ae71fd08abc74d2b85bc7c4b48e99a0baa673628a1207cecda9afd3836 |
File details
Details for the file xmlhtml2pandas-0.14-py3-none-any.whl
.
File metadata
- Download URL: xmlhtml2pandas-0.14-py3-none-any.whl
- Upload date:
- Size: 32.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3a60b80fd1551256a4df4a1f4906b11ea13a120c17a69ad79a742296252c10da |
|
MD5 | d6d6638837c47146475d88ba5fc6a2a9 |
|
BLAKE2b-256 | 40a7671e769ebd4e1dafdb872f618ab3f6f625af8fe6f1470f0f185091ab5ea0 |