Skip to main content

Helps you turn a pile of scanned PDFs into a searchable, browsable categorized digital archive - automatically

Project description

PDF Shelver

PDF Shelver helps you turn a pile of scanned PDFs into a searchable, browsable categorized digital archive - automatically!

pdfshelver is a command-line tool that helps you organize scanned PDF documents using OCR (Optical Character Recognition) and LLM (Large Language Model), via models running locally with Ollama. It extracts key metadata (sender, subject, document type, and category) from your PDFs, stores the original PDF files and metadata, and creates a "knowledge base" of softlinks is a separate directory for easy human retrieval.

Important:

  • at the moment macOS 14+ (Sonoma and later) only, as it uses the new Apple LiveText OCR framework
  • LLMs can be memory intensive. 16 GiB memory are recommended when using the default model "qwen3:8b", the fallback model "gemma3:12b" will probably need 24 GiB
  • "good enough" for personal use, but not hardened to catch each and every potentially failure case as would be needed when using in commercial production environments

Background (and alternatives)

pdfshelver was written as a small exercise and showcase for the question "What does one need to keep in mind when being asked Can we do something like this with AI and state-of-the-art frameworks?" which one encounters frequently nowadays in any company, bigger or smaller.

I chose a simple problem in my quest for making my personal life easier: automating archiving (and making findable) of paper mail I receive. I do have a scanner which is able to scan stacks of paper both sides and store them as PDFs on a NAS (network attached storage), but those PDFs are neither searchable (no OCR having been performed by the scanner) nor easily findable by topic, or sender, or ... by me once they are somewhere on the filesystem. This is where pdfshelver comes in.

Turns out that while a quick proof-of-concept can be thrown together in an hour or two, basic hardening for 'most important' failure cases will still need quite a bit more programming time even when using AI supported coding. Besides typical error possibilities like, e.g., file handling, a whole new category of error checking needs to be reserved for the LLM operation, ranging from "runaway" LLMs - where the LLM simply does not stop generating output - to failures of the LLM to generate the output you asked for - no matter how intricate or detailed you formulate your prompts.

I am well aware of other tools like OCRmyPDF or even complete systems like Paperless-ngx (which in turn relies on OCRmyPDF).

Both tools are fantastic, and I encourage you to check them out in case you want to build anything more or less "enterprise-ready." However, preliminary tests I did on a couple of scans showed me that the macOS OCR framework gave me quite a bit better OCRed text than OCRmyPDF. Not several orders of magnitude better, but better enough for me to choose to not rely on OCRmyPDF for this small finger exercise. Besides, most of the work needed to make paper-mail archiving (error checking, LLM failure checking, etc.) would be exactly the same with or without OCRmyPDF.

Key Features

  • Organized Storage: Stores the original PDF and extracted metadata in a designated directory.
  • OCR Processing: Extracts text from scanned PDFs using OCR.
  • Metadata Extraction: Uses an LLM to identify sender, subject, document type, and category from the document content.
  • Knowledge Base: Creates human-readable softlinks in a "sortedby" directory, making it easy to browse and find documents by sender, category, etc. within the filesystem.
  • Customizable: Supports multiple Ollama models, custom system/user prompts, and flexible directory setup.
  • Rebuild Capability: Can rebuild the knowledge base from existing stored PDFs and metadata without re-running OCR or LLM.

Extracted metadata

pdfshelver in its default configuration will extract the following data from a PDF:

  • Sender (as free text): who is the sender / author of a Document.
  • Subject (as free text): a short summary one-liner regarding what the PDF is about
  • Document type (choices): is this PDF an "invoice", a "contract", some "info" (information), or "other"?
  • Category (choices): does this concern "social", "health", "job", "finance", "pension", "insurance", "taxes", "living", or "other"?

CAVEAT: LLMs are sometimes hit and miss when it comes to the things they will give back as answer. I put in-place basic error checking to make sure that what is extracted is correct regarding syntax, e.g., under category you will get exactly one of the choices the LLM has. However, there is no way to check the LLM actually made the right choice.

Requirements and installation

Requirements

  • Running at least on macOS Sonoma (macOS 14 and later) with at least 16 GiB RAM.
  • Ollama running locally or accessible via network. See below for installation.
  • poppler tools, this should be handled by brew, see below for installation.
  • Python 3.13+. This should be handled by uv, see below for installation.
  • Required Python packages: ollama, pdf2image, ocrmac, etc. ... and their dependencies. Also handled by uv.

Installation

Step 1: Ollama and Ollama models

I assume you already have Ollama installed locally on your machine, or that it is available to you via network. If not, head over to Ollama to install it. Then install the default models used by pdfshelver like so: ollama pull qwen3:8b and ollama pull gemma3:12b.

Step 2: poppler

pdfshelver needs utilities from the poppler library. The easiest way to install these is via Homebrew. If not installed, do it now. Then a simple brew install poppler will do the trick.

Step 3: PDF Shelver itself

To install pdfshelver itself, I recommend uv as this Python package and project manager basically makes all headaches of Python package management go away in an instant. If not installed, do this now. Then simply type uv tool install pdfshelver and your are good to go!

Quick Start

1. Set Up Directories

You need two directories:

  • Store Directory: Where original PDFs and metadata will be saved.
  • Sortedby Directory: Where categorized softlinks are created for easy browsing.

You can specify these via command-line options or environment variables. Using environment variables, put them into your shell startup script (e.g. .bashrc if using bash) or do this:

export PDFSHELVER_DIR_STORE=~/pdfshelver/store
export PDFSHELVER_DIR_SORTEDBY=~/pdfshelver/sortedby

Or use --dir_store and --dir_sortedby options on the pdfshelver command line call.

2. Process a PDF

pdfshelver myscan.pdf

This will:

  • copy myscan.pdf to the store directory.
  • run OCR and extract text.
  • use Ollama to extract metadata.
  • save metadata and OCR text.
  • create categorized softlinks in the sortedby directory.

3. Browse Your Knowledge Base

Navigate the sortedby directory to find your documents organized by sender, category, etc. In the default organisation, you will find your PDF in all of the the following 'sortedby' directories:

  • 'from'. That is, organised just by sender/author
  • 'fromcat'. That is, first organised by sender/author, then by the category (e.g. "health") the content belongs to.
  • 'catfrom', i.e., first organised by by category ("e.g. "health"), then by sender

Command-Line Options

  • PDFfile (positional): The PDF to process.
  • --dir_store DIR: Directory to store PDFs and metadata.
  • --dir_sortedby DIR: Directory for the knowledge base (softlinks).
  • --replstr STR: String in filenames to replace with metadata (default: autoscan).
  • --sysin FILE: Custom SYSTEM prompt for Ollama.
  • --usrin FILE: Custom USER prompt template for Ollama.
  • --model NAME: Ollama model(s) to use (comma-separated). Default atm is qwen3:8b,gemma3:12b
  • --opts OPTS: Ollama options (e.g., temperature=0.0;num_ctx=32768).
  • --host HOST: Ollama server host (default: localhost:11434).
  • --rebuildkb: Rebuild the knowledge base from existing store (no OCR/LLM).
  • --opthelp: Show available Ollama options.
  • --optdesc: Show Ollama options with descriptions.
  • --msgs: Show default SYSTEM and USER messages used by pdfshelver for the LLM models.

Example

pdfshelver --dir_store ~/pdfshelver/store --dir_sortedby ~/pdfshelver/sortedby 20250510_autoscan_181036.pdf

As the filename contains the string autoscan as in the example above, it will be replaced in the sortedby directory with a string like:

20250510_John Doe LLC -- invoice -- living -- Delivery cupboard_181036.pdf

Rebuilding the Knowledge Base

If you reorganize or lose your sortedby directory, you can rebuild it from the store:

pdfshelver --rebuildkb

Limitations

  • Due to simplistic OCR interpretation, is restricted to documents where the writing is left-to-right, top-to-bottom
  • Only the first two pages of each PDF are used for LLM metadata extraction as the metadata to be extracted is often found there. Using more pages often confuses small LLMs and leads to worse performance regarding the quality of their answer.
  • All metadata and OCR text are stored alongside the original PDF for future reference, but not stored within the PDF. I.e., no PDF/A is created.
  • LLMs sometimes take ... astonishing decisions when processing a document. Expect the results to be "mostly right", but not always 100% correct.
  • Especially the "sender" information extracted could use a 2nd step polishing as it may vary wildly. E.g., in one document the LLM might extract "Big Company LLC" as sender, while in a document with same headers but different textual content it might be extracted as "Big Company", or "BigCo", or "BigCo Shipping", or, or, or ...

Troubleshooting

  • Make sure the store and sortedby directories exist and are writable.
  • Ensure Ollama is running and accessible.
  • For help on Ollama options, use --opthelp or --optdesc.

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfshelver-0.1.3.tar.gz (34.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdfshelver-0.1.3-py3-none-any.whl (21.8 kB view details)

Uploaded Python 3

File details

Details for the file pdfshelver-0.1.3.tar.gz.

File metadata

  • Download URL: pdfshelver-0.1.3.tar.gz
  • Upload date:
  • Size: 34.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.3

File hashes

Hashes for pdfshelver-0.1.3.tar.gz
Algorithm Hash digest
SHA256 59729b6c9d92db96922af19e692cf12c415c3d2ca491129bf9c678f115f5e843
MD5 c457e8f546964217926b7600c6142016
BLAKE2b-256 9fa62ad16da358560a6b90dc83c20b84f5daef4ac4db632d2c9048358248de70

See more details on using hashes here.

File details

Details for the file pdfshelver-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: pdfshelver-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 21.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.3

File hashes

Hashes for pdfshelver-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 3b16b637315d7875d510304a40a427551a7a2d8f2eef2a085289c48f5aff6d0b
MD5 2c7af08f592f24fd37f1eba937165fcd
BLAKE2b-256 d18fbf2c1e3b2e5737892c49d72e757ec8ee7584d0cd99bfb25bdba5f113f62e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page