Skip to main content

The AI bureaucrat that fills forms so you don't have to (the `pencilpusher` Python package)

Reason this release was yanked:

Issue

Project description

pencilpusher

CI Python License: MIT Status: alpha

The AI bureaucrat that fills forms so you don't have to.

Drop your documents into a folder. pencilpusher reads them, builds a personal + company knowledge wiki, and when you drop a form into the inbox — it fills it. PDFs, Word docs, government forms, company registrations. No re-typing, no re-extracting.

Built on Karpathy's LLM-wiki pattern + Microsoft MarkItDown for document conversion.

How it works

sources/                         wiki/                          inbox/ → outbox/
(drop your docs here)           (auto-built by LLM)            (drop forms → filled forms)

  passport.pdf     ──ingest──►  identity.md              ┐
  id_card.pdf      ──ingest──►  contacts.md              │
  pacra_printout   ──ingest──►  companies/acme.md        ├──►  Form10_filled.docx
  bank_letter.pdf  ──ingest──►  banking.md               │     KYC_filled.pdf
  company_reg.pdf  ──ingest──►  companies/mycorp.md      ┘

Three layers (Karpathy's architecture):

  1. sources/ — drop your documents here (IDs, passports, company docs, bank letters)
  2. wiki/ — LLM-maintained knowledge base (personal data + per-company pages)
  3. inbox/outbox/ — drop forms to fill, get filled forms back

Quick start

# Install from source (PyPI release coming soon)
git clone https://github.com/Loodt/pencilpusher.git
cd pencilpusher
pip install -e .

# 1. Initialize your vault
pencilpusher init

# 2. Drop documents into ~/.pencilpusher/sources/
#    (IDs, passports, company registrations, bank letters, etc.)

# 3. Build your knowledge wiki
pencilpusher ingest-all

# 4. Check what it extracted
pencilpusher show identity
pencilpusher show companies/acme

# 5. Drop forms into ~/.pencilpusher/inbox/

# 6. Fill them all!
pencilpusher fill-all

Or fill individual files:

pencilpusher ingest passport.pdf
pencilpusher fill application.docx -o filled.docx

Features

  • Folder-based workflow — drop docs in, get filled forms out
  • PDF forms — AcroForm fields (exact fill) and flat PDFs (text position detection)
  • Word docs — SDT content controls (zipfile+lxml), table cells, placeholders
  • MarkItDown powered — converts any document to Markdown (cheaper than vision API)
  • Company pages — auto-creates per-company wiki pages from PACRA/CIPC printouts
  • Smart matching — Claude matches "Applicant Full Name" → your name from the wiki
  • Style preservation — fills values without touching fonts, sizes, colors, or layout
  • Manifest tracking — skips already-ingested files automatically

Real-world tested

Successfully produced a 10-document Zambian PACRA compliance package:

  • 4 filled PACRA forms (Form 10, 20, 22, 24)
  • 6 supporting documents (board resolution, notices, minutes, consent, cover letter)
  • From scattered data: company printouts, ID card, passport, chat history

Technical stack

Component Tool Why
Document → Markdown MarkItDown (Microsoft, 96K stars) Structured text from any format
DOCX SDT filling zipfile + lxml (first principles) Bypasses python-docx limitations
DOCX table filling python-docx table cell access Government form tables
PDF AcroForm filling PyMuPDF widget API Direct field value setting
PDF flat filling PyMuPDF insert_text Clean text at detected positions
Data extraction Claude API (text) Structured extraction from markdown
Field matching Claude API (text) Semantic matching to wiki data
CLI Click + Rich Clean command interface

Commands

Command Description
pencilpusher init Create vault with sources/inbox/outbox/wiki folders
pencilpusher ingest <file> Ingest a single document (API)
pencilpusher ingest-all Ingest all new files from sources/ (API)
pencilpusher fill <form> Fill a single form (API)
pencilpusher fill-all Fill all forms in inbox/ → outbox/ (API)
pencilpusher show [page] Display vault index or specific wiki page
pencilpusher lint Health-check the wiki
pencilpusher files List stored source documents
pencilpusher read <file> Convert any document to Markdown (no API)
pencilpusher detect <form> Detect form fields as JSON (no API for AcroForm/DOCX)
pencilpusher write-wiki <page> <content> Write directly to a vault wiki page (no API)
pencilpusher fill <form> --field-map '{...}' Fill with explicit mapping (no API)
pencilpusher fill <form> --field-map '{...}' --fields-json '[...]' Fill flat PDF with agent-provided field positions (no API)

Agent-driven mode (no API key needed)

pencilpusher can be used by AI coding agents (Claude Code, OpenAI Codex, etc.) without an Anthropic API key. The agent does the LLM reasoning; pencilpusher does the document manipulation.

# 1. Read a document — agent gets Markdown back
pencilpusher read passport.pdf

# 2. Agent reasons about the data, then writes to vault
pencilpusher write-wiki identity "# Identity\nName: Jane Moyo\nDOB: 1990-03-15"

# 3. Detect form fields — agent gets JSON back
pencilpusher detect application.pdf

# 4. Agent matches fields to vault data, then fills
pencilpusher fill application.pdf --field-map '{"Full Name": "Jane Moyo", "Date of Birth": "15 March 1990"}'

# For flat PDFs (no AcroForm), the agent also provides field positions:
pencilpusher fill flat.pdf \
  --field-map '{"Full Name": "Jane Moyo"}' \
  --fields-json '[{"name": "Full Name", "bbox": [15, 20, 50, 3], "page": 0}]'

The read, detect, write-wiki, and fill --field-map commands make zero API calls. For flat PDFs, pass --fields-json with field positions from the agent's own vision analysis. The existing ingest and fill commands still work standalone with an API key.

Requirements

  • Python 3.10+
  • Anthropic API key (ANTHROPIC_API_KEY) — only needed for ingest, fill (without --field-map), and lint

Limitations & known issues

pencilpusher is alpha (v0.1.0). It works well on the document classes it has been tested against; it will not handle every form in the wild yet.

What works well today:

  • AcroForm PDFs with named fields
  • Flat PDFs where field labels are present as selectable text
  • DOCX with SDT content controls, table cells, or simple {{placeholder}} markers
  • Latin-script (English) form labels and values

What's flaky or unsupported:

  • Scanned / image-only PDFs — no OCR in the default install. The optional [ocr] extra installs pytesseract but the pipeline doesn't yet feed OCR output into field detection. Expect to do this manually for now.
  • Non-Latin scripts in field labels — Claude can match them, but PyMuPDF text insertion uses the document's default font, which may not contain the needed glyphs. Workaround: AcroForm PDFs only.
  • Checkboxes and radio groups in flat PDFs — only AcroForm checkboxes are reliably filled. Flat-PDF checkbox detection is not yet implemented.
  • Multi-page forms with repeating sections (e.g. "list each director on a separate row") — the field matcher treats each row independently and may duplicate values.
  • Forms with overlapping or rotated text — flat-PDF position detection assumes labels are upright and non-overlapping.
  • Encrypted / password-protected PDFs — must be unlocked before passing to pencilpusher.
  • Excel (XLSX) and OpenDocument (ODT) — not supported. PRs welcome.
  • Windows path edge cases — generally works, but report anything you hit.

If you have a form that fails, please open an issue with an anonymised sample. Real-world failures are the main thing driving v0.2.

Development

git clone https://github.com/Loodt/pencilpusher.git
cd pencilpusher
pip install -e ".[dev]"
pytest

See CONTRIBUTING.md for the full guide.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paperpusher-0.1.0.tar.gz (156.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paperpusher-0.1.0-py3-none-any.whl (34.1 kB view details)

Uploaded Python 3

File details

Details for the file paperpusher-0.1.0.tar.gz.

File metadata

  • Download URL: paperpusher-0.1.0.tar.gz
  • Upload date:
  • Size: 156.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for paperpusher-0.1.0.tar.gz
Algorithm Hash digest
SHA256 fb86abe3387a6bba91d048e300030f1fb8458ea7af1978b22c8b02527c152367
MD5 6b02cd92f03c294dad33825ccd74b834
BLAKE2b-256 0e85023c7b638c5880be3bc16d6557f079403dc19853043169a720907ac05017

See more details on using hashes here.

File details

Details for the file paperpusher-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: paperpusher-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 34.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for paperpusher-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ea9567548a0ac6e00cdba7df55047390f9e978ac2db327396ca6a6d963896ffb
MD5 edcf2fe7e13eff83fcde43904f922546
BLAKE2b-256 721468f9f2279c544cfcf879342ad806484f11a1b455606fbef58fe98cc76d6a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page