The AI bureaucrat that fills forms so you don't have to (the `pencilpusher` Python package)

These details have not been verified by PyPI

Project links

Project description

pencilpusher

The AI bureaucrat that fills forms so you don't have to.

Drop your documents into a folder. pencilpusher reads them, builds a personal + company knowledge wiki, and when you drop a form into the inbox — it fills it. PDFs, Word docs, government forms, company registrations. No re-typing, no re-extracting.

Built on Karpathy's LLM-wiki pattern + Microsoft MarkItDown for document conversion.

How it works

sources/                         wiki/                          inbox/ → outbox/
(drop your docs here)           (auto-built by LLM)            (drop forms → filled forms)

  passport.pdf     ──ingest──►  identity.md              ┐
  id_card.pdf      ──ingest──►  contacts.md              │
  pacra_printout   ──ingest──►  companies/acme.md        ├──►  Form10_filled.docx
  bank_letter.pdf  ──ingest──►  banking.md               │     KYC_filled.pdf
  company_reg.pdf  ──ingest──►  companies/mycorp.md      ┘

Three layers (Karpathy's architecture):

sources/ — drop your documents here (IDs, passports, company docs, bank letters)
wiki/ — LLM-maintained knowledge base (personal data + per-company pages)
inbox/outbox/ — drop forms to fill, get filled forms back

Quick start

# Install from source (PyPI release coming soon)
git clone https://github.com/Loodt/pencilpusher.git
cd pencilpusher
pip install -e .

# 1. Initialize your vault
pencilpusher init

# 2. Drop documents into ~/.pencilpusher/sources/
#    (IDs, passports, company registrations, bank letters, etc.)

# 3. Build your knowledge wiki
pencilpusher ingest-all

# 4. Check what it extracted
pencilpusher show identity
pencilpusher show companies/acme

# 5. Drop forms into ~/.pencilpusher/inbox/

# 6. Fill them all!
pencilpusher fill-all

Or fill individual files:

pencilpusher ingest passport.pdf
pencilpusher fill application.docx -o filled.docx

Features

Folder-based workflow — drop docs in, get filled forms out
PDF forms — AcroForm fields (exact fill) and flat PDFs (text position detection)
Word docs — SDT content controls (zipfile+lxml), table cells, placeholders
MarkItDown powered — converts any document to Markdown (cheaper than vision API)
Company pages — auto-creates per-company wiki pages from PACRA/CIPC printouts
Smart matching — Claude matches "Applicant Full Name" → your name from the wiki
Style preservation — fills values without touching fonts, sizes, colors, or layout
Manifest tracking — skips already-ingested files automatically

Real-world tested

Successfully produced a 10-document Zambian PACRA compliance package:

4 filled PACRA forms (Form 10, 20, 22, 24)
6 supporting documents (board resolution, notices, minutes, consent, cover letter)
From scattered data: company printouts, ID card, passport, chat history

Technical stack

Component	Tool	Why
Document → Markdown	MarkItDown (Microsoft, 96K stars)	Structured text from any format
DOCX SDT filling	zipfile + lxml (first principles)	Bypasses python-docx limitations
DOCX table filling	python-docx table cell access	Government form tables
PDF AcroForm filling	PyMuPDF widget API	Direct field value setting
PDF flat filling	PyMuPDF insert_text	Clean text at detected positions
Data extraction	Claude API (text)	Structured extraction from markdown
Field matching	Claude API (text)	Semantic matching to wiki data
CLI	Click + Rich	Clean command interface

Commands

Command	Description
`pencilpusher init`	Create vault with sources/inbox/outbox/wiki folders
`pencilpusher ingest <file>`	Ingest a single document (API)
`pencilpusher ingest-all`	Ingest all new files from sources/ (API)
`pencilpusher fill <form>`	Fill a single form (API)
`pencilpusher fill-all`	Fill all forms in inbox/ → outbox/ (API)
`pencilpusher show [page]`	Display vault index or specific wiki page
`pencilpusher lint`	Health-check the wiki
`pencilpusher files`	List stored source documents
`pencilpusher read <file>`	Convert any document to Markdown (no API)
`pencilpusher detect <form>`	Detect form fields as JSON (no API for AcroForm/DOCX)
`pencilpusher write-wiki <page> <content>`	Write directly to a vault wiki page (no API)
`pencilpusher fill <form> --field-map '{...}'`	Fill with explicit mapping (no API)
`pencilpusher fill <form> --field-map '{...}' --fields-json '[...]'`	Fill flat PDF with agent-provided field positions (no API)

Agent-driven mode (no API key needed)

pencilpusher can be used by AI coding agents (Claude Code, OpenAI Codex, etc.) without an Anthropic API key. The agent does the LLM reasoning; pencilpusher does the document manipulation.

# 1. Read a document — agent gets Markdown back
pencilpusher read passport.pdf

# 2. Agent reasons about the data, then writes to vault
pencilpusher write-wiki identity "# Identity\nName: Jane Moyo\nDOB: 1990-03-15"

# 3. Detect form fields — agent gets JSON back
pencilpusher detect application.pdf

# 4. Agent matches fields to vault data, then fills
pencilpusher fill application.pdf --field-map '{"Full Name": "Jane Moyo", "Date of Birth": "15 March 1990"}'

# For flat PDFs (no AcroForm), the agent also provides field positions:
pencilpusher fill flat.pdf \
  --field-map '{"Full Name": "Jane Moyo"}' \
  --fields-json '[{"name": "Full Name", "bbox": [15, 20, 50, 3], "page": 0}]'

The read, detect, write-wiki, and fill --field-map commands make zero API calls. For flat PDFs, pass --fields-json with field positions from the agent's own vision analysis. The existing ingest and fill commands still work standalone with an API key.

Requirements

Python 3.10+
Anthropic API key (ANTHROPIC_API_KEY) — only needed for ingest, fill (without --field-map), and lint

Limitations & known issues

pencilpusher is alpha (v0.1.0). It works well on the document classes it has been tested against; it will not handle every form in the wild yet.

What works well today:

AcroForm PDFs with named fields
Flat PDFs where field labels are present as selectable text
DOCX with SDT content controls, table cells, or simple {{placeholder}} markers
Latin-script (English) form labels and values

What's flaky or unsupported:

Scanned / image-only PDFs — no OCR in the default install. The optional [ocr] extra installs pytesseract but the pipeline doesn't yet feed OCR output into field detection. Expect to do this manually for now.
Non-Latin scripts in field labels — Claude can match them, but PyMuPDF text insertion uses the document's default font, which may not contain the needed glyphs. Workaround: AcroForm PDFs only.
Checkboxes and radio groups in flat PDFs — only AcroForm checkboxes are reliably filled. Flat-PDF checkbox detection is not yet implemented.
Multi-page forms with repeating sections (e.g. "list each director on a separate row") — the field matcher treats each row independently and may duplicate values.
Forms with overlapping or rotated text — flat-PDF position detection assumes labels are upright and non-overlapping.
Encrypted / password-protected PDFs — must be unlocked before passing to pencilpusher.
Excel (XLSX) and OpenDocument (ODT) — not supported. PRs welcome.
Windows path edge cases — generally works, but report anything you hit.

If you have a form that fails, please open an issue with an anonymised sample. Real-world failures are the main thing driving v0.2.

Development

git clone https://github.com/Loodt/pencilpusher.git
cd pencilpusher
pip install -e ".[dev]"
pytest

See CONTRIBUTING.md for the full guide.

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Apr 13, 2026

This version

0.1.0 yanked

Apr 13, 2026

Reason this release was yanked:

Issue

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paperpusher-0.1.0.tar.gz (156.4 kB view details)

Uploaded Apr 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

paperpusher-0.1.0-py3-none-any.whl (34.1 kB view details)

Uploaded Apr 13, 2026 Python 3

File details

Details for the file paperpusher-0.1.0.tar.gz.

File metadata

Download URL: paperpusher-0.1.0.tar.gz
Upload date: Apr 13, 2026
Size: 156.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for paperpusher-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`fb86abe3387a6bba91d048e300030f1fb8458ea7af1978b22c8b02527c152367`
MD5	`6b02cd92f03c294dad33825ccd74b834`
BLAKE2b-256	`0e85023c7b638c5880be3bc16d6557f079403dc19853043169a720907ac05017`

See more details on using hashes here.

File details

Details for the file paperpusher-0.1.0-py3-none-any.whl.

File metadata

Download URL: paperpusher-0.1.0-py3-none-any.whl
Upload date: Apr 13, 2026
Size: 34.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for paperpusher-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ea9567548a0ac6e00cdba7df55047390f9e978ac2db327396ca6a6d963896ffb`
MD5	`edcf2fe7e13eff83fcde43904f922546`
BLAKE2b-256	`721468f9f2279c544cfcf879342ad806484f11a1b455606fbef58fe98cc76d6a`

See more details on using hashes here.

paperpusher 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pencilpusher

How it works

Quick start

Features

Real-world tested

Technical stack

Commands

Agent-driven mode (no API key needed)

Requirements

Limitations & known issues

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes