Skip to main content

OCR toolkit for arabic PDFs and directories

Project description

TRK FTZ OCR

TRK FTZ OCR is a lightweight OCR toolkit for extracting text from PDFs and folders of PDFs.
It supports single-page extraction, full PDF processing, directory processing, parallel execution, and multiple output formats.


🚀 Features

  • 📄 Extract text from a single page
  • 📚 Process full PDF files
  • 📁 Process directories of PDFs
  • ⚡ Parallel processing for speed
  • 🧠 Adaptive OCR pipeline (PSM-based)
  • 📊 Export results to:
    • Dictionary (Python)
    • TXT files (per page / per file)
    • Excel (.xlsx)

📦 Installation

pip install trk-ftz-ocr

🖥️ CLI Usage

After installation, use:

trk-ftz-ocr --help

📄 Process a single PDF

Return text (default)

trk-ftz-ocr file.pdf --mode dict

Save as folder (per-page TXT files)

trk-ftz-ocr file.pdf --mode folder --output output/

Save as single Excel file

trk-ftz-ocr file.pdf --mode excel --output output.xlsx

📁 Process a directory

trk-ftz-ocr pdfs/ --mode folder --output output/

This creates:

output/
  file1/
    file1_page_1.txt
    file1_page_2.txt
  file2/
    file2_page_1.txt

⚙️ Options

Option Description
--page Extract a single page
--output Output path (file or folder)
--lang OCR language (default: ara)
--zoom Render quality (default: 6)
--parallel Enable parallel processing
--mode Output mode: dict, file, folder, excel

📊 Output Modes

dict

Returns Python dictionary:

{page_number: text}

file

Saves all text into a single .txt file.

folder

Saves each page as a separate .txt file.

excel

Saves results as:

file | page | text

🧠 Example (Python API)

from trk_ftz_ocr.pipeline import process

result = process(
    path="file.pdf",
    output_path="output/",
    mode="dict",
    parallel=True
)

print(result)

⚡ Performance

  • Parallel page processing (ThreadPool)
  • Parallel directory processing
  • Lightweight preprocessing pipeline
  • Adaptive OCR configuration

📌 Requirements

  • Python >= 3.12
  • Tesseract OCR installed on system

🛠️ Dependencies

  • PyMuPDF
  • pytesseract
  • pillow
  • tqdm
  • pandas
  • openpyxl

📄 License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trk_ftz_ocr-0.1.0.tar.gz (6.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trk_ftz_ocr-0.1.0-py3-none-any.whl (10.4 kB view details)

Uploaded Python 3

File details

Details for the file trk_ftz_ocr-0.1.0.tar.gz.

File metadata

  • Download URL: trk_ftz_ocr-0.1.0.tar.gz
  • Upload date:
  • Size: 6.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for trk_ftz_ocr-0.1.0.tar.gz
Algorithm Hash digest
SHA256 850339f8eed3e30ea89efca2b94b671de8375a549d09ddb563e6cb8b4f739e50
MD5 7711e675c253649e89f623494664f0fd
BLAKE2b-256 db1216d8e7acfcc088bc80b5db6aafb48bcc2155f2529cfed8a8620f80cb89b3

See more details on using hashes here.

File details

Details for the file trk_ftz_ocr-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: trk_ftz_ocr-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for trk_ftz_ocr-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ede7ccecff615acea65b19b101b394861e537420dca7114146afdf5cdb75e4d9
MD5 b403f33335cbeaabe0ebde22cf27030d
BLAKE2b-256 a42235c95a5f0a7808adf991d95f0fdc6678ce0ab74eb34d1bfb6a3fd2db57cc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page