OCR toolkit for arabic PDFs and directories
Project description
TRK FTZ OCR
TRK FTZ OCR is a lightweight OCR toolkit for extracting text from PDFs and folders of PDFs.
It supports single-page extraction, full PDF processing, directory processing, parallel execution, and multiple output formats.
🚀 Features
- 📄 Extract text from a single page
- 📚 Process full PDF files
- 📁 Process directories of PDFs
- ⚡ Parallel processing for speed
- 🧠 Adaptive OCR pipeline (PSM-based)
- 📊 Export results to:
- Dictionary (Python)
- TXT files (per page / per file)
- Excel (.xlsx)
📦 Installation
pip install trk-ftz-ocr
🖥️ CLI Usage
After installation, use:
trk-ftz-ocr --help
📄 Process a single PDF
Return text (default)
trk-ftz-ocr file.pdf --mode dict
Save as folder (per-page TXT files)
trk-ftz-ocr file.pdf --mode folder --output output/
Save as single Excel file
trk-ftz-ocr file.pdf --mode excel --output output.xlsx
📁 Process a directory
trk-ftz-ocr pdfs/ --mode folder --output output/
This creates:
output/
file1/
file1_page_1.txt
file1_page_2.txt
file2/
file2_page_1.txt
⚙️ Options
| Option | Description |
|---|---|
--page |
Extract a single page |
--output |
Output path (file or folder) |
--lang |
OCR language (default: ara) |
--zoom |
Render quality (default: 6) |
--parallel |
Enable parallel processing |
--mode |
Output mode: dict, file, folder, excel |
📊 Output Modes
dict
Returns Python dictionary:
{page_number: text}
file
Saves all text into a single .txt file.
folder
Saves each page as a separate .txt file.
excel
Saves results as:
file | page | text
🧠 Example (Python API)
from trk_ftz_ocr.pipeline import process
result = process(
path="file.pdf",
output_path="output/",
mode="dict",
parallel=True
)
print(result)
⚡ Performance
- Parallel page processing (ThreadPool)
- Parallel directory processing
- Lightweight preprocessing pipeline
- Adaptive OCR configuration
📌 Requirements
- Python >= 3.12
- Tesseract OCR installed on system
🛠️ Dependencies
- PyMuPDF
- pytesseract
- pillow
- tqdm
- pandas
- openpyxl
📄 License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file trk_ftz_ocr-0.1.0.tar.gz.
File metadata
- Download URL: trk_ftz_ocr-0.1.0.tar.gz
- Upload date:
- Size: 6.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
850339f8eed3e30ea89efca2b94b671de8375a549d09ddb563e6cb8b4f739e50
|
|
| MD5 |
7711e675c253649e89f623494664f0fd
|
|
| BLAKE2b-256 |
db1216d8e7acfcc088bc80b5db6aafb48bcc2155f2529cfed8a8620f80cb89b3
|
File details
Details for the file trk_ftz_ocr-0.1.0-py3-none-any.whl.
File metadata
- Download URL: trk_ftz_ocr-0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ede7ccecff615acea65b19b101b394861e537420dca7114146afdf5cdb75e4d9
|
|
| MD5 |
b403f33335cbeaabe0ebde22cf27030d
|
|
| BLAKE2b-256 |
a42235c95a5f0a7808adf991d95f0fdc6678ce0ab74eb34d1bfb6a3fd2db57cc
|