Warraqa (ورّاقة) — Document Scribe Agent. Converts PDFs and Word/PowerPoint files to clean Markdown with self-scoring.

These details have not been verified by PyPI

Project links

Project description

Warraqa (ورّاقة)

The Document Scribe Agent

Named after the Warrāqūn — the master scribes and paper-makers of the Islamic Golden Age.

Warraqa converts PDF, Word, and PowerPoint documents into clean, accurate Markdown — and scores her own work.

Why Warraqa?

Most document-to-Markdown tools are one-trick ponies: great at clean PDFs, terrible at scans; great at .docx, blind to .doc; or they silently produce garbage and let you discover it three pipelines later.

Warraqa is a specialist agent. She picks the right engine for each file, falls back gracefully, scores her output from 0–100 with letter grades, and tells you which conversions to trust. She's built to feed RAG pipelines, knowledge bases, and downstream agents — where Markdown quality directly determines retrieval quality.

Features

Dual-engine architecture — best specialized tool for each format
- Marker (deep learning) for scanned PDFs: tables, equations, multi-column, OCR
- PyMuPDF4LLM (fast, CPU-only) for native-text PDFs
- MarkItDown (Microsoft) for .docx and .pptx
- MS Office COM auto-converts legacy .doc and .ppt to modern formats first
- Pandoc fallback for .docx resilience
Smart triage — every PDF is pre-scanned to detect native vs. scanned content; routing is automatic
Two-phase batch processing — fast files (native PDFs, Word, PowerPoint) run first; slow OCR work is deferred to a single trailing pass so you don't wait on Marker mid-batch
Quality scoring — every conversion gets a 0–100 confidence score with an A–F grade across 5 dimensions (completeness, structure, encoding, density, readability)
Crash-resistant — sanitizes invalid Unicode from upstream engines so a single bad PDF can't kill a 1000-file run
Folder workflow — input → convert → output + move originals to processed/ or failed/
Watch mode — continuous monitoring for new files
Inter-agent API — designed for other agents to call programmatically

Quick Start

Step 1 — Install Python 3.11 or 3.12

Important: Warraqa requires Python 3.10 to 3.13. Python 3.14 is not yet supported by some upstream dependencies (Pillow, regex) and will fail during install.

Windows:

Go to python.org/downloads and download the Python 3.12 installer (look for "Python 3.12.x" under Stable Releases).
Run the installer.
On the very first screen, check the box that says "Add python.exe to PATH" — this is the most commonly missed step.
Click "Install Now".

Verify — open a new PowerShell window and run:

python --version

You should see Python 3.12.x. If you see an error, close and reopen PowerShell and try again.

Step 2 — Install Pandoc (for Word document fallback)

Windows (recommended):

winget install --id JohnMacFarlane.Pandoc -e

Alternative: download the .msi installer from pandoc.org/installing.html.

Verify:

pandoc --version

Step 3 — Install Warraqa

Open PowerShell (not the Python prompt — if you see >>>, type exit() first) and run:

pip install warraqa

This downloads Warraqa and all its dependencies (~300–400 MB including PyTorch).

Verify:

warraqa --help

Step 4 — Run

warraqa --folder "C:\path\to\your\documents"

The first time you convert a scanned PDF, Marker downloads its deep-learning models (~2–3 GB). This is a one-time download; subsequent runs use the cached models.

Option B — Clone + bootstrap script

git clone https://github.com/AALAM-Studio/warraqa.git
cd warraqa
python bootstrap.py        # creates .venv, installs deps, auto-installs Pandoc on Windows
.venv\Scripts\activate     # Linux/macOS: source .venv/bin/activate
python run.py

Option C — Docker (for cloud / headless use)

docker build -t warraqa .
docker run --rm -v "/path/to/docs:/data" warraqa --folder /data

The Docker image is CPU-only and does not include MS Office, so legacy .doc/.ppt will be skipped with a clean error message.

Usage

warraqa                              # Manual mode — opens a folder picker dialog
warraqa --folder "C:\path"           # Process a specific folder
warraqa --file path/to/document.pdf  # Convert a single file
warraqa --watch --folder "C:\path"   # Watch mode — continuously monitor
warraqa --folder "C:\path" --no-save --no-move    # Dry run
warraqa --help                       # All options

Output Structure

output/
├── md_files/        # Converted Markdown files
├── processed/       # Successfully converted originals
├── failed/          # Failed conversion originals
├── reports/         # JSON reports with scores and metadata
├── scanned_pdfs/    # Staging area for OCR-bound PDFs (auto-cleaned per run)
└── warraqa.log

Quality Scoring

Every conversion is scored across 5 weighted dimensions:

Dimension	Weight	What It Measures
Text Completeness	30%	Word count vs. expected density for file size
Structure Integrity	25%	Headings, lists, tables, formatting
Encoding Quality	20%	Garbled text, mojibake, Unicode issues
Content Density	15%	Meaningful text vs. noise
Readability	10%	Line length, paragraph structure

Grades: A (90–100) → B (75–89) → C (60–74) → D (40–59) → F (0–39). Files scoring below 40 are moved to output/failed/ automatically.

Inter-Agent API

from warraqa import Warraqa

agent = Warraqa()

# Convert a single file
result = agent.convert_file("document.pdf")
print(result.confidence_score)    # 87
print(result.grade)               # "B"
print(result.markdown_content)    # "# Title\n\n..."
print(result.output_path)         # Path to saved .md file

# Process a folder
results = agent.process_folder("C:/Users/you/Academia")
for r in results:
    print(f"{r.source_file.filename}: {r.grade} ({r.confidence_score}/100)")

Configuration

Edit config.yaml to customize:

Default mode (manual / watch)
Output directories
Engine preferences (primary / fallback per format)
Scoring thresholds
Logging level

Supported Formats

Extension	Engine	Notes
`.pdf` (native text)	PyMuPDF4LLM	Fast, CPU-only
`.pdf` (scanned)	Marker	Deferred to Phase 2 OCR pass
`.docx`	MarkItDown → Pandoc	—
`.doc`	MS Office COM → MarkItDown	Windows + Office required
`.pptx`	MarkItDown	—
`.ppt`	MS Office COM → MarkItDown	Windows + Office required

Troubleshooting

`pip install warraqa` fails with "Failed building wheel for Pillow" or "Microsoft Visual C++ required"

Cause: You are running Python 3.14. Warraqa's OCR engine requires Pillow 10.x, which has no pre-built Windows package for Python 3.14.

Fix: Install Python 3.12 from python.org/downloads. You can have multiple Python versions installed. Then run:

py -3.12 -m pip install warraqa

`pip install warraqa` gives `SyntaxError: invalid syntax`

Cause: You typed pip install warraqa inside the Python REPL (the >>> prompt). pip is a terminal command, not a Python command.

Fix: Type exit() to leave Python, then run pip install warraqa in PowerShell.

`warraqa` is not recognized after install

Cause: Either the install failed, or Python's Scripts folder is not on your PATH.

Check if installed:

pip show warraqa
# If it shows version info, the scripts folder isn't on PATH — run via:
python -m warraqa --help

Permanent PATH fix: search Windows for "Edit the system environment variables" → Environment Variables → User Path → add C:\Users\<YourName>\AppData\Local\Programs\Python\Python312\Scripts.

`pandoc` is not recognized

Reinstall via winget install --id JohnMacFarlane.Pandoc -e and open a new PowerShell window. Warraqa still converts .docx without Pandoc — it just loses the Pandoc fallback if MarkItDown fails.

Scanned PDF conversion is very slow (hours per file)

Cause: pip install warraqa installs the CPU-only version of PyTorch. If you have an NVIDIA GPU, Marker will ignore it and run entirely on your CPU — which can take 30–60 minutes per file instead of 2–4 minutes.

Warraqa will print a warning at startup if it detects this situation.

Fix — enable GPU acceleration:

Check your CUDA version:
```
nvidia-smi
```
Look for "CUDA Version: XX.X" in the top-right of the output.

Uninstall the CPU torch:

pip uninstall torch torchvision torchaudio -y

Install the CUDA-enabled torch (use the line matching your CUDA version):

# CUDA 12.1 or newer (most common on modern drivers)
pip install torch --index-url https://download.pytorch.org/whl/cu121

# CUDA 11.8
pip install torch --index-url https://download.pytorch.org/whl/cu118

Verify GPU is detected:

python -c "import torch; print(torch.cuda.get_device_name(0))"

After this, Warraqa will show your GPU name at startup and Marker will use it automatically. A file that took 1 hour on CPU typically takes 2–5 minutes on a modern GPU.

First scanned-PDF conversion still slow after GPU fix? That's normal for the very first run — Marker downloads ~2–3 GB of model weights once. After that, models are cached and startup is fast.

Legacy `.doc` / `.ppt` files are skipped

These require Microsoft Word / PowerPoint (Windows only). If you see a "COM not available" warning, install Microsoft Office or convert the files to .docx/.pptx format first.

License

Warraqa is published under the PolyForm Noncommercial License 1.0.0 — a source-available license that allows free use for:

Personal projects, research, study, and experimentation
Academic and educational institutions
Charitable, public-safety, health, and government organizations
Internal evaluation by any organization

Commercial use — including using Warraqa as part of a product or service offered to paying customers, internal business operations at a for-profit company, or any revenue-generating workflow — requires a separate commercial license. Contact contact@aalam.consulting to discuss licensing.

Note on terminology: PolyForm Noncommercial is source-available, not open source in the OSI sense (which by definition allows commercial use). The full text is in LICENSE.

Versioning Policy

This repository contains the public 1.x line of Warraqa. Future major versions are developed privately and available under commercial license terms. Critical bug fixes are backported to 1.x at AALAM Studio's discretion.

See CHANGELOG.md for the release history.

Citation

If Warraqa contributes to academic research, please cite it. A machine-readable CITATION.cff is provided, or use the GitHub "Cite this repository" button.

Acknowledgements

Warraqa stands on the shoulders of excellent open-source projects:

Marker — Vik Paruchuri's deep-learning PDF parser
PyMuPDF4LLM — Artifex's LLM-optimized PDF extraction
MarkItDown — Microsoft's universal-to-markdown converter
Pandoc — John MacFarlane's document conversion swiss-army knife
Rich — Will McGugan's terminal beautifier

Part of Aalam Studio

Warraqa is the first publicly released agent in the AALAM Studio ecosystem. Other agents access her output at a predictable path:

WARRAQA_OUTPUT = "c:/projects/aalam-studio/warraqa/output/"

She reads. She transcribes. She scores her own work.

Built with care by AALAM Studio.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.3

May 19, 2026

1.0.2

May 18, 2026

1.0.1

May 18, 2026

1.0.0

May 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

warraqa-1.0.3.tar.gz (39.5 kB view details)

Uploaded May 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

warraqa-1.0.3-py3-none-any.whl (43.4 kB view details)

Uploaded May 19, 2026 Python 3

File details

Details for the file warraqa-1.0.3.tar.gz.

File metadata

Download URL: warraqa-1.0.3.tar.gz
Upload date: May 19, 2026
Size: 39.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for warraqa-1.0.3.tar.gz
Algorithm	Hash digest
SHA256	`0a97e595623fad5ad5cd2af6982e3d11da154f98f591461b98a3a3f546edcf9e`
MD5	`0e69a19430fadb2cc1e8e56087bf37fa`
BLAKE2b-256	`1c8532b9659ea3d849a3041c659863c348ff9a46f799a79962a6c842f7e520b9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for warraqa-1.0.3.tar.gz:

Publisher: publish.yml on aalthaqafi-ai/warraqa-pub

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: warraqa-1.0.3.tar.gz
- Subject digest: 0a97e595623fad5ad5cd2af6982e3d11da154f98f591461b98a3a3f546edcf9e
- Sigstore transparency entry: 1574622268
- Sigstore integration time: May 19, 2026
Source repository:
- Permalink: aalthaqafi-ai/warraqa-pub@753a02a870c5d8e3a082d1fadd6b5c26b58b7945
- Branch / Tag: refs/tags/v1.0.3
- Owner: https://github.com/aalthaqafi-ai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@753a02a870c5d8e3a082d1fadd6b5c26b58b7945
- Trigger Event: release

File details

Details for the file warraqa-1.0.3-py3-none-any.whl.

File metadata

Download URL: warraqa-1.0.3-py3-none-any.whl
Upload date: May 19, 2026
Size: 43.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for warraqa-1.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0452277f0ac01bfb98d9b87969b5beafb080ed3486e0a61c93c9455be4c7b608`
MD5	`c9495c0b8ce520c9818ce6472d994bda`
BLAKE2b-256	`61d54824ba6e0d088c858ba22c9a7eb93abbc859af4ce2b2ba9e39f1e0160f8a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for warraqa-1.0.3-py3-none-any.whl:

Publisher: publish.yml on aalthaqafi-ai/warraqa-pub

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: warraqa-1.0.3-py3-none-any.whl
- Subject digest: 0452277f0ac01bfb98d9b87969b5beafb080ed3486e0a61c93c9455be4c7b608
- Sigstore transparency entry: 1574622390
- Sigstore integration time: May 19, 2026
Source repository:
- Permalink: aalthaqafi-ai/warraqa-pub@753a02a870c5d8e3a082d1fadd6b5c26b58b7945
- Branch / Tag: refs/tags/v1.0.3
- Owner: https://github.com/aalthaqafi-ai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@753a02a870c5d8e3a082d1fadd6b5c26b58b7945
- Trigger Event: release

warraqa 1.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Warraqa (ورّاقة)

The Document Scribe Agent

Why Warraqa?

Features

Quick Start

Step 1 — Install Python 3.11 or 3.12

Step 2 — Install Pandoc (for Word document fallback)

Step 3 — Install Warraqa

Step 4 — Run

Option B — Clone + bootstrap script

Option C — Docker (for cloud / headless use)

Usage

Output Structure

Quality Scoring

Inter-Agent API

Configuration

Supported Formats

Troubleshooting

pip install warraqa fails with "Failed building wheel for Pillow" or "Microsoft Visual C++ required"

pip install warraqa gives SyntaxError: invalid syntax

warraqa is not recognized after install

pandoc is not recognized

Scanned PDF conversion is very slow (hours per file)

Legacy .doc / .ppt files are skipped

License

Versioning Policy

Citation

Acknowledgements

Part of Aalam Studio

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`pip install warraqa` fails with "Failed building wheel for Pillow" or "Microsoft Visual C++ required"

`pip install warraqa` gives `SyntaxError: invalid syntax`

`warraqa` is not recognized after install

`pandoc` is not recognized

Legacy `.doc` / `.ppt` files are skipped