Skip to main content

OCR and Arabic text correction tools with WordBank

Project description

trk-mmr-tools

PDF extraction, Arabic text correction, and WordBank tools

Description

trk-mmr-tools provides a streamlined workflow for handling complex Arabic document processing and automated text cleanup. It specializes in:

  • Hybrid PDF Extraction: High-performance text extraction using PyMuPDF with an automatic OCR fallback via Tesseract.
  • Arabic Text Normalization: Advanced cleaning and correction tailored for Arabic script.
  • WordBank Integration: Automated spell-checking and dictionary-based validation using high-performance indexing for large datasets.

It is designed for researchers, developers, and data scientists processing Arabic PDFs for NLP, media ethnography, or building academic word databases.


Prerequisites

This package is a Python wrapper for the Tesseract OCR engine. You must install the engine on your operating system for the OCR features to work:

  • Ubuntu / Google Colab:
    sudo apt-get update
    sudo apt-get install tesseract-ocr tesseract-ocr-ara
    

  * **macOS:**
    ```bash
    brew install tesseract tesseract-lang
    ```
  * **Windows:** Download the installer from [UB Mannheim](https://www.google.com/search?q=https://github.com/UB-Mannheim/tesseract/wiki). Ensure you check the box for **Arabic** script data during installation and add the Tesseract directory to your System PATH.

-----

## Installation

```bash
# Using pip
pip install trk-mmr-tools
```

> **Note:** If you are using this in an environment with older versions of NumPy (like some versions of JAX or OpenCV), ensure you have **NumPy 1.26.0 or higher**.

-----

## Usage

The following example demonstrates how to process a PDF (or a folder of PDFs) using the OCR method with Arabic language support and WordBank corrections.

```python
from pathlib import Path
from trk_mmr_tools.pdf.processor import process_pdfs
from trk_mmr_tools.text.correction import TextCorrection

# Define input and output paths
pdf_input = Path("tests/sample.pdf")  # Can be a file or a directory
output_dir = Path("output")
output_dir.mkdir(exist_ok=True)

# Initialize the Arabic text corrector (loads WordBank assets)
corrector = TextCorrection()

# Process the PDFs
process_pdfs(
    source=pdf_input,
    output_dir=output_dir,
    method="ocr",       # Uses Tesseract for scanned/image-based PDFs
    lang="ara",         # Specifies Arabic language for OCR
    clean=True,         # Normalizes characters and removes noise
    corrector=corrector  # Applies the dictionary-based validator
)
```

-----

## Project Structure

For your custom WordBank and assets to be recognized, ensure your package follows this structure:

```text
trk-mmr-tools/
├── pyproject.toml
├── README.md
└── src/
    └── trk_mmr_tools/
        ├── __init__.py
        ├── pdf/
        ├── text/
        └── assets/
            ├── bank.pkl
            └── data.xlsx
```

-----

## License

MIT License

## Author

**Tarek**

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trk_mmr_tools-0.2.0.tar.gz (2.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trk_mmr_tools-0.2.0-py3-none-any.whl (2.1 MB view details)

Uploaded Python 3

File details

Details for the file trk_mmr_tools-0.2.0.tar.gz.

File metadata

  • Download URL: trk_mmr_tools-0.2.0.tar.gz
  • Upload date:
  • Size: 2.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for trk_mmr_tools-0.2.0.tar.gz
Algorithm Hash digest
SHA256 14be0acfc4edd616f121cfcddc98d9c237cb28036039fe65fad204f1e866400b
MD5 41febb28693458789d55755f60d70ced
BLAKE2b-256 746cc93b2e22494da4b0cffa883bb9795d827cb9c59d2ae4cc8387980ff662c7

See more details on using hashes here.

File details

Details for the file trk_mmr_tools-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: trk_mmr_tools-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for trk_mmr_tools-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e3ea0b64744a4c27631fc2321ec014a3e139f0b9ebd11aef3dd9a940f1a3b18f
MD5 34a5069971aa5fd495751ca118bef49d
BLAKE2b-256 c564a9c0c89ef32f04843de5877073257b730f176779b39a23b7d6876e070b40

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page