Warraqa (ورّاقة) — Document Scribe Agent. Converts PDFs and Word/PowerPoint files to clean Markdown with self-scoring.
Project description
Warraqa (ورّاقة)
The Document Scribe Agent
Named after the Warrāqūn — the master scribes and paper-makers of the Islamic Golden Age.
Warraqa converts PDF, Word, and PowerPoint documents into clean, accurate Markdown — and scores her own work.
Why Warraqa?
Most document-to-Markdown tools are one-trick ponies: great at clean PDFs, terrible at scans; great at .docx, blind to .doc; or they silently produce garbage and let you discover it three pipelines later.
Warraqa is a specialist agent. She picks the right engine for each file, falls back gracefully, scores her output from 0–100 with letter grades, and tells you which conversions to trust. She's built to feed RAG pipelines, knowledge bases, and downstream agents — where Markdown quality directly determines retrieval quality.
Features
- Dual-engine architecture — best specialized tool for each format
- Marker (deep learning) for scanned PDFs: tables, equations, multi-column, OCR
- PyMuPDF4LLM (fast, CPU-only) for native-text PDFs
- MarkItDown (Microsoft) for
.docxand.pptx - MS Office COM auto-converts legacy
.docand.pptto modern formats first - Pandoc fallback for
.docxresilience
- Smart triage — every PDF is pre-scanned to detect native vs. scanned content; routing is automatic
- Two-phase batch processing — fast files (native PDFs, Word, PowerPoint) run first; slow OCR work is deferred to a single trailing pass so you don't wait on Marker mid-batch
- Quality scoring — every conversion gets a 0–100 confidence score with an A–F grade across 5 dimensions (completeness, structure, encoding, density, readability)
- Crash-resistant — sanitizes invalid Unicode from upstream engines so a single bad PDF can't kill a 1000-file run
- Folder workflow — input → convert → output + move originals to
processed/orfailed/ - Watch mode — continuous monitoring for new files
- Inter-agent API — designed for other agents to call programmatically
Quick Start
Option 1 — pip (recommended)
pip install warraqa
warraqa --folder "C:\path\to\documents"
You still need Pandoc on PATH for .docx fallback, and MS Office (Windows) for legacy .doc/.ppt. The Marker engine downloads its ML models on first use (~2–3 GB).
Option 2 — Clone + bootstrap script
git clone https://github.com/AALAM-Studio/warraqa.git
cd warraqa
python bootstrap.py # creates .venv, installs deps, auto-installs Pandoc on Windows
.venv\Scripts\activate # Linux/macOS: source .venv/bin/activate
python run.py
Option 3 — Docker (for cloud / headless use)
docker build -t warraqa .
docker run --rm -v "/path/to/docs:/data" warraqa --folder /data
Note: the Docker image is CPU-only and does not include MS Office, so legacy .doc/.ppt will be skipped with a clean error message.
Usage
warraqa # Manual mode — opens a folder picker dialog
warraqa --folder "C:\path" # Process a specific folder
warraqa --file path/to/document.pdf # Convert a single file
warraqa --watch --folder "C:\path" # Watch mode — continuously monitor
warraqa --folder "C:\path" --no-save --no-move # Dry run
warraqa --help # All options
Output Structure
output/
├── md_files/ # Converted Markdown files
├── processed/ # Successfully converted originals
├── failed/ # Failed conversion originals
├── reports/ # JSON reports with scores and metadata
├── scanned_pdfs/ # Staging area for OCR-bound PDFs (auto-cleaned per run)
└── warraqa.log
Quality Scoring
Every conversion is scored across 5 weighted dimensions:
| Dimension | Weight | What It Measures |
|---|---|---|
| Text Completeness | 30% | Word count vs. expected density for file size |
| Structure Integrity | 25% | Headings, lists, tables, formatting |
| Encoding Quality | 20% | Garbled text, mojibake, Unicode issues |
| Content Density | 15% | Meaningful text vs. noise |
| Readability | 10% | Line length, paragraph structure |
Grades: A (90–100) → B (75–89) → C (60–74) → D (40–59) → F (0–39).
Files scoring below 40 are moved to output/failed/ automatically.
Inter-Agent API
from warraqa import Warraqa
agent = Warraqa()
# Convert a single file
result = agent.convert_file("document.pdf")
print(result.confidence_score) # 87
print(result.grade) # "B"
print(result.markdown_content) # "# Title\n\n..."
print(result.output_path) # Path to saved .md file
# Process a folder
results = agent.process_folder("C:/Users/you/Academia")
for r in results:
print(f"{r.source_file.filename}: {r.grade} ({r.confidence_score}/100)")
Configuration
Edit config.yaml to customize:
- Default mode (manual / watch)
- Output directories
- Engine preferences (primary / fallback per format)
- Scoring thresholds
- Logging level
Supported Formats
| Extension | Engine | Notes |
|---|---|---|
.pdf (native text) |
PyMuPDF4LLM | Fast, CPU-only |
.pdf (scanned) |
Marker | Deferred to Phase 2 OCR pass |
.docx |
MarkItDown → Pandoc | — |
.doc |
MS Office COM → MarkItDown | Windows + Office required |
.pptx |
MarkItDown | — |
.ppt |
MS Office COM → MarkItDown | Windows + Office required |
License
Warraqa is published under the PolyForm Noncommercial License 1.0.0 — a source-available license that allows free use for:
- Personal projects, research, study, and experimentation
- Academic and educational institutions
- Charitable, public-safety, health, and government organizations
- Internal evaluation by any organization
Commercial use — including using Warraqa as part of a product or service offered to paying customers, internal business operations at a for-profit company, or any revenue-generating workflow — requires a separate commercial license. Contact contact@aalam.consulting to discuss licensing.
Note on terminology: PolyForm Noncommercial is source-available, not open source in the OSI sense (which by definition allows commercial use). The full text is in LICENSE.
Versioning Policy
This repository contains Warraqa v1.0.0 — the inaugural public, source-available release. Future versions of Warraqa are developed privately and available under commercial license terms. Critical bug fixes may be backported to v1.x at AALAM Studio's discretion.
See CHANGELOG.md for the release history.
Citation
If Warraqa contributes to academic research, please cite it. A machine-readable CITATION.cff is provided, or use the GitHub "Cite this repository" button.
Acknowledgements
Warraqa stands on the shoulders of excellent open-source projects:
- Marker — Vik Paruchuri's deep-learning PDF parser
- PyMuPDF4LLM — Artifex's LLM-optimized PDF extraction
- MarkItDown — Microsoft's universal-to-markdown converter
- Pandoc — John MacFarlane's document conversion swiss-army knife
- Rich — Will McGugan's terminal beautifier
Part of Aalam Studio
Warraqa is the first publicly released agent in the AALAM Studio ecosystem. Other agents access her output at a predictable path:
WARRAQA_OUTPUT = "c:/projects/aalam-studio/warraqa/output/"
She reads. She transcribes. She scores her own work.
Built with care by AALAM Studio.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file warraqa-1.0.0.tar.gz.
File metadata
- Download URL: warraqa-1.0.0.tar.gz
- Upload date:
- Size: 35.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0de6ebffd56ffbdc5083015c6192307fe9fc33d1403c2775c291ac682b8a62da
|
|
| MD5 |
ff103be3485a8689e0c87d01d1e19557
|
|
| BLAKE2b-256 |
872e113bd291d28d0b89ebc1eb245fdac07d677e8f47c1163104b3adfb67b156
|
Provenance
The following attestation bundles were made for warraqa-1.0.0.tar.gz:
Publisher:
publish.yml on AALAM-Studio/warraqa
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
warraqa-1.0.0.tar.gz -
Subject digest:
0de6ebffd56ffbdc5083015c6192307fe9fc33d1403c2775c291ac682b8a62da - Sigstore transparency entry: 1561998619
- Sigstore integration time:
-
Permalink:
AALAM-Studio/warraqa@9716873232d0eda8ea1cdfc8799a74bf71b29261 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/AALAM-Studio
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9716873232d0eda8ea1cdfc8799a74bf71b29261 -
Trigger Event:
push
-
Statement type:
File details
Details for the file warraqa-1.0.0-py3-none-any.whl.
File metadata
- Download URL: warraqa-1.0.0-py3-none-any.whl
- Upload date:
- Size: 40.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bdf69e25ae0cda84d6a30f01992840fd5f5365cf8b592c96af3ea008bd1695d7
|
|
| MD5 |
c66d38c3fa835f91c69fad6e17b5671b
|
|
| BLAKE2b-256 |
1ca866e762ed1448ec64ce854fb45062bbc0e2122f9bedea1b2a692e8d23a326
|
Provenance
The following attestation bundles were made for warraqa-1.0.0-py3-none-any.whl:
Publisher:
publish.yml on AALAM-Studio/warraqa
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
warraqa-1.0.0-py3-none-any.whl -
Subject digest:
bdf69e25ae0cda84d6a30f01992840fd5f5365cf8b592c96af3ea008bd1695d7 - Sigstore transparency entry: 1561998672
- Sigstore integration time:
-
Permalink:
AALAM-Studio/warraqa@9716873232d0eda8ea1cdfc8799a74bf71b29261 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/AALAM-Studio
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9716873232d0eda8ea1cdfc8799a74bf71b29261 -
Trigger Event:
push
-
Statement type: