OCR and Arabic text correction tools with WordBank
Project description
trk-mmr-tools
PDF extraction, Arabic text correction, and WordBank tools
Description
trk-mmr-tools provides a streamlined workflow for handling complex Arabic document processing and automated text cleanup. It specializes in:
- Hybrid PDF Extraction: High-performance text extraction using PyMuPDF with an automatic OCR fallback via Tesseract.
- Arabic Text Normalization: Advanced cleaning and correction tailored for Arabic script.
- WordBank Integration: Automated spell-checking and dictionary-based validation using high-performance indexing for large datasets.
It is designed for researchers, developers, and data scientists processing Arabic PDFs for NLP, media ethnography, or building academic word databases.
Prerequisites
This package is a Python wrapper for the Tesseract OCR engine. You must install the engine on your operating system for the OCR features to work:
- Ubuntu / Google Colab:
sudo apt-get update sudo apt-get install tesseract-ocr tesseract-ocr-ara
* **macOS:**
```bash
brew install tesseract tesseract-lang
```
* **Windows:** Download the installer from [UB Mannheim](https://www.google.com/search?q=https://github.com/UB-Mannheim/tesseract/wiki). Ensure you check the box for **Arabic** script data during installation and add the Tesseract directory to your System PATH.
-----
## Installation
```bash
# Using pip
pip install trk-mmr-tools
```
> **Note:** If you are using this in an environment with older versions of NumPy (like some versions of JAX or OpenCV), ensure you have **NumPy 1.26.0 or higher**.
-----
## Usage
The following example demonstrates how to process a PDF (or a folder of PDFs) using the OCR method with Arabic language support and WordBank corrections.
```python
from pathlib import Path
from trk_mmr_tools.pdf.processor import process_pdfs
from trk_mmr_tools.text.correction import TextCorrection
# Define input and output paths
pdf_input = Path("tests/sample.pdf") # Can be a file or a directory
output_dir = Path("output")
output_dir.mkdir(exist_ok=True)
# Initialize the Arabic text corrector (loads WordBank assets)
corrector = TextCorrection()
# Process the PDFs
process_pdfs(
source=pdf_input,
output_dir=output_dir,
method="ocr", # Uses Tesseract for scanned/image-based PDFs
lang="ara", # Specifies Arabic language for OCR
clean=True, # Normalizes characters and removes noise
corrector=corrector # Applies the dictionary-based validator
)
```
-----
## Project Structure
For your custom WordBank and assets to be recognized, ensure your package follows this structure:
```text
trk-mmr-tools/
├── pyproject.toml
├── README.md
└── src/
└── trk_mmr_tools/
├── __init__.py
├── pdf/
├── text/
└── assets/
├── bank.pkl
└── data.xlsx
```
-----
## License
MIT License
## Author
**Tarek**
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file trk_mmr_tools-0.2.0.tar.gz.
File metadata
- Download URL: trk_mmr_tools-0.2.0.tar.gz
- Upload date:
- Size: 2.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
14be0acfc4edd616f121cfcddc98d9c237cb28036039fe65fad204f1e866400b
|
|
| MD5 |
41febb28693458789d55755f60d70ced
|
|
| BLAKE2b-256 |
746cc93b2e22494da4b0cffa883bb9795d827cb9c59d2ae4cc8387980ff662c7
|
File details
Details for the file trk_mmr_tools-0.2.0-py3-none-any.whl.
File metadata
- Download URL: trk_mmr_tools-0.2.0-py3-none-any.whl
- Upload date:
- Size: 2.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e3ea0b64744a4c27631fc2321ec014a3e139f0b9ebd11aef3dd9a940f1a3b18f
|
|
| MD5 |
34a5069971aa5fd495751ca118bef49d
|
|
| BLAKE2b-256 |
c564a9c0c89ef32f04843de5877073257b730f176779b39a23b7d6876e070b40
|