OCR and Arabic text correction tools with WordBank
Project description
# TRT Tarek Tools
***PDF extraction, Arabic text correction, and WordBank tools***
## Description
This package provides tools for:
* Extracting text from PDF files (with OCR fallback using PyMuPDF and Tesseract)
* Cleaning and correcting Arabic text
* Checking words against a WordBank and applying corrections
It is useful for processing Arabic PDFs, preparing text for NLP, or building word databases.
## Installation
\# Using pip
pip install trt-tarek-tools
## Usage
Here is a simple example using the process\_pdfs function:
from pathlib import Path
from trk\_mmr\_tools.pdf.processor import process\_pdfs
from trk\_mmr\_tools.text.correction import TextCorrection
pdf\_input = Path("tests/sample.pdf") # or folder of PDFs
output\_dir = Path("output")
output\_dir.mkdir(exist\_ok=True)
corrector = TextCorrection()
process\_pdfs(
source=pdf\_input,
output\_dir=output\_dir,
method="ocr",
lang="ara",
clean=True,
corrector=corrector
)
## License
MIT License
## Author
Tarek
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file trk_mmr_tools-0.1.0.tar.gz.
File metadata
- Download URL: trk_mmr_tools-0.1.0.tar.gz
- Upload date:
- Size: 2.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bdf52c4711ac42530b3573e1b1ee2affaa42d303c7cee86214390604d1298c0c
|
|
| MD5 |
a8c17b90e4ea9333407b9738baff4e20
|
|
| BLAKE2b-256 |
ebb3022fe02b4100402e41dcf9fc0aba75a09746e3a12aae0b3d200813843388
|
File details
Details for the file trk_mmr_tools-0.1.0-py3-none-any.whl.
File metadata
- Download URL: trk_mmr_tools-0.1.0-py3-none-any.whl
- Upload date:
- Size: 2.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
da0e91ef0e2358d2c9a68cebf57ce86a9ba8f41bd3879dd0632ed716188cdae2
|
|
| MD5 |
c6f45da8b735eefd7885d567eaf8c055
|
|
| BLAKE2b-256 |
a1d86f5f6f126025300b5c9b87632ae5a2cdbb1fc7e7c79c99d905be9d5f530b
|