The AI bureaucrat that fills forms so you don't have to (the `pencilpusher` Python package)
Project description
pencilpusher
The AI bureaucrat that fills forms so you don't have to.
Install with
pip install paperpusher— thepencilpushername was taken on PyPI by an unrelated project, so the distribution ships aspaperpusherwhile the CLI and import remainpencilpusher.
Drop your documents into a folder. pencilpusher reads them, builds a personal + company knowledge wiki, and when you drop a form into the inbox — it fills it. PDFs, Word docs, government forms, company registrations. No re-typing, no re-extracting.
Built on Karpathy's LLM-wiki pattern + Microsoft MarkItDown for document conversion.
How it works
sources/ wiki/ inbox/ → outbox/
(drop your docs here) (auto-built by LLM) (drop forms → filled forms)
passport.pdf ──ingest──► identity.md ┐
id_card.pdf ──ingest──► contacts.md │
pacra_printout ──ingest──► companies/acme.md ├──► Form10_filled.docx
bank_letter.pdf ──ingest──► banking.md │ KYC_filled.pdf
company_reg.pdf ──ingest──► companies/mycorp.md ┘
Three layers (Karpathy's architecture):
- sources/ — drop your documents here (IDs, passports, company docs, bank letters)
- wiki/ — LLM-maintained knowledge base (personal data + per-company pages)
- inbox/outbox/ — drop forms to fill, get filled forms back
Quick start
pip install paperpusher # PyPI distribution name (the `pencilpusher` name was taken)
# 1. Initialize your vault
pencilpusher init
# 2. Drop documents into ~/.pencilpusher/sources/
# (IDs, passports, company registrations, bank letters, etc.)
# 3. Build your knowledge wiki
pencilpusher ingest-all
# 4. Check what it extracted
pencilpusher show identity
pencilpusher show companies/acme
# 5. Drop forms into ~/.pencilpusher/inbox/
# 6. Fill them all!
pencilpusher fill-all
Or fill individual files:
pencilpusher ingest passport.pdf
pencilpusher fill application.docx -o filled.docx
Features
- Folder-based workflow — drop docs in, get filled forms out
- PDF forms — AcroForm fields (exact fill) and flat PDFs (text position detection)
- Word docs — SDT content controls (zipfile+lxml), table cells, placeholders
- MarkItDown powered — converts any document to Markdown (cheaper than vision API)
- Company pages — auto-creates per-company wiki pages from PACRA/CIPC printouts
- Smart matching — Claude matches "Applicant Full Name" → your name from the wiki
- Style preservation — fills values without touching fonts, sizes, colors, or layout
- Manifest tracking — skips already-ingested files automatically
Real-world tested
Successfully produced a 10-document Zambian PACRA compliance package:
- 4 filled PACRA forms (Form 10, 20, 22, 24)
- 6 supporting documents (board resolution, notices, minutes, consent, cover letter)
- From scattered data: company printouts, ID card, passport, chat history
Technical stack
| Component | Tool | Why |
|---|---|---|
| Document → Markdown | MarkItDown (Microsoft, 96K stars) | Structured text from any format |
| DOCX SDT filling | zipfile + lxml (first principles) | Bypasses python-docx limitations |
| DOCX table filling | python-docx table cell access | Government form tables |
| PDF AcroForm filling | PyMuPDF widget API | Direct field value setting |
| PDF flat filling | PyMuPDF insert_text | Clean text at detected positions |
| Data extraction | Claude API (text) | Structured extraction from markdown |
| Field matching | Claude API (text) | Semantic matching to wiki data |
| CLI | Click + Rich | Clean command interface |
Commands
| Command | Description |
|---|---|
pencilpusher init |
Create vault with sources/inbox/outbox/wiki folders |
pencilpusher ingest <file> |
Ingest a single document (API) |
pencilpusher ingest-all |
Ingest all new files from sources/ (API) |
pencilpusher fill <form> |
Fill a single form (API) |
pencilpusher fill-all |
Fill all forms in inbox/ → outbox/ (API) |
pencilpusher show [page] |
Display vault index or specific wiki page |
pencilpusher lint |
Health-check the wiki |
pencilpusher files |
List stored source documents |
pencilpusher read <file> |
Convert any document to Markdown (no API) |
pencilpusher detect <form> |
Detect form fields as JSON (no API for AcroForm/DOCX) |
pencilpusher write-wiki <page> <content> |
Write directly to a vault wiki page (no API) |
pencilpusher fill <form> --field-map '{...}' |
Fill with explicit mapping (no API) |
pencilpusher fill <form> --field-map '{...}' --fields-json '[...]' |
Fill flat PDF with agent-provided field positions (no API) |
Agent-driven mode (no API key needed)
pencilpusher can be used by AI coding agents (Claude Code, OpenAI Codex, etc.) without an Anthropic API key. The agent does the LLM reasoning; pencilpusher does the document manipulation.
# 1. Read a document — agent gets Markdown back
pencilpusher read passport.pdf
# 2. Agent reasons about the data, then writes to vault
pencilpusher write-wiki identity "# Identity\nName: Jane Moyo\nDOB: 1990-03-15"
# 3. Detect form fields — agent gets JSON back
pencilpusher detect application.pdf
# 4. Agent matches fields to vault data, then fills
pencilpusher fill application.pdf --field-map '{"Full Name": "Jane Moyo", "Date of Birth": "15 March 1990"}'
# For flat PDFs (no AcroForm), the agent also provides field positions:
pencilpusher fill flat.pdf \
--field-map '{"Full Name": "Jane Moyo"}' \
--fields-json '[{"name": "Full Name", "bbox": [15, 20, 50, 3], "page": 0}]'
The read, detect, write-wiki, and fill --field-map commands make zero API calls. For flat PDFs, pass --fields-json with field positions from the agent's own vision analysis. The existing ingest and fill commands still work standalone with an API key.
Requirements
- Python 3.10+
- Anthropic API key (
ANTHROPIC_API_KEY) — only needed foringest,fill(without --field-map), andlint
Limitations & known issues
pencilpusher is alpha (v0.1.0). It works well on the document classes it has been tested against; it will not handle every form in the wild yet.
What works well today:
- AcroForm PDFs with named fields
- Flat PDFs where field labels are present as selectable text
- DOCX with SDT content controls, table cells, or simple
{{placeholder}}markers - Latin-script (English) form labels and values
What's flaky or unsupported:
- Scanned / image-only PDFs — no OCR in the default install. The optional
[ocr]extra installspytesseractbut the pipeline doesn't yet feed OCR output into field detection. Expect to do this manually for now. - Non-Latin scripts in field labels — Claude can match them, but PyMuPDF text insertion uses the document's default font, which may not contain the needed glyphs. Workaround: AcroForm PDFs only.
- Checkboxes and radio groups in flat PDFs — only AcroForm checkboxes are reliably filled. Flat-PDF checkbox detection is not yet implemented.
- Multi-page forms with repeating sections (e.g. "list each director on a separate row") — the field matcher treats each row independently and may duplicate values.
- Forms with overlapping or rotated text — flat-PDF position detection assumes labels are upright and non-overlapping.
- Encrypted / password-protected PDFs — must be unlocked before passing to pencilpusher.
- Excel (XLSX) and OpenDocument (ODT) — not supported. PRs welcome.
- Windows path edge cases — generally works, but report anything you hit.
If you have a form that fails, please open an issue with an anonymised sample. Real-world failures are the main thing driving v0.2.
Development
git clone https://github.com/Loodt/pencilpusher.git
cd pencilpusher
pip install -e ".[dev]"
pytest
See CONTRIBUTING.md for the full guide.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file paperpusher-0.1.1.tar.gz.
File metadata
- Download URL: paperpusher-0.1.1.tar.gz
- Upload date:
- Size: 160.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7d9023841f5e4ae0632b630ab3e7d3ab2345a3ec4c94ed3d14ec67c56386aff6
|
|
| MD5 |
9553863907b1bb832076422dc32ec517
|
|
| BLAKE2b-256 |
62ff7f444113e0b133aff6df1e4abb6440077a1263a001741720099f444e5049
|
File details
Details for the file paperpusher-0.1.1-py3-none-any.whl.
File metadata
- Download URL: paperpusher-0.1.1-py3-none-any.whl
- Upload date:
- Size: 38.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
33854cd7147abb119a7015447328753413d21ab039c9c35e88416c84c47b2bd8
|
|
| MD5 |
dd1e1730004792cd60b55a6c64cf2c22
|
|
| BLAKE2b-256 |
45b2b51876794c1ea6c8b00028db0525b6560bceae76d5381a496cbe4210e6ba
|