Skip to main content

OCR and Arabic text correction tools with WordBank

Project description

# TRT Tarek Tools

***PDF extraction, Arabic text correction, and WordBank tools***

## Description

This package provides tools for:

* Extracting text from PDF files (with OCR fallback using PyMuPDF and Tesseract)

* Cleaning and correcting Arabic text

* Checking words against a WordBank and applying corrections

It is useful for processing Arabic PDFs, preparing text for NLP, or building word databases.

## Installation

\# Using pip

pip install trt-tarek-tools

## Usage

Here is a simple example using the process\_pdfs function:

from pathlib import Path

from trk\_mmr\_tools.pdf.processor import process\_pdfs

from trk\_mmr\_tools.text.correction import TextCorrection



pdf\_input = Path("tests/sample.pdf")  # or folder of PDFs

output\_dir = Path("output")



output\_dir.mkdir(exist\_ok=True)



corrector = TextCorrection()



process\_pdfs(

    source=pdf\_input,

    output\_dir=output\_dir,

    method="ocr",

    lang="ara",

    clean=True,

    corrector=corrector

)

## License

MIT License

## Author

Tarek

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trk_mmr_tools-0.1.0.tar.gz (2.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trk_mmr_tools-0.1.0-py3-none-any.whl (2.1 MB view details)

Uploaded Python 3

File details

Details for the file trk_mmr_tools-0.1.0.tar.gz.

File metadata

  • Download URL: trk_mmr_tools-0.1.0.tar.gz
  • Upload date:
  • Size: 2.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for trk_mmr_tools-0.1.0.tar.gz
Algorithm Hash digest
SHA256 bdf52c4711ac42530b3573e1b1ee2affaa42d303c7cee86214390604d1298c0c
MD5 a8c17b90e4ea9333407b9738baff4e20
BLAKE2b-256 ebb3022fe02b4100402e41dcf9fc0aba75a09746e3a12aae0b3d200813843388

See more details on using hashes here.

File details

Details for the file trk_mmr_tools-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: trk_mmr_tools-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for trk_mmr_tools-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 da0e91ef0e2358d2c9a68cebf57ce86a9ba8f41bd3879dd0632ed716188cdae2
MD5 c6f45da8b735eefd7885d567eaf8c055
BLAKE2b-256 a1d86f5f6f126025300b5c9b87632ae5a2cdbb1fc7e7c79c99d905be9d5f530b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page