Skip to main content

Watch a scans folder and auto-rename receipt images/PDFs (and course/document scans) from their OCR'd text.

Project description

receipt-renamer

Point it at the folder your scanner app drops files into. For every receipt image/PDF it OCRs the page and renames it to something searchable:

Receipt 2024-03-14 1042 Safeway SNAP EBT Adobe Scan.jpg

Non-receipt scans (lecture notes, handouts) are handled by a separate rule layer: it reads the page-top title and expands course codes into every form you might later search for, so CHE2A Midterm Review.pdf becomes:

CHE2A Midterm Review CHE 2A Chem 2A Genius Scan.pdf

and OChem Lecture Notes.pdf becomes:

OChem Lecture Notes Organic Chemistry CHE8A CHE 8A Chem 8A.pdf

Everything — the store list, the SNAP/EBT patterns, the receipt heuristics, the scanner-app markers, the course rules, and the filename templates — lives in one YAML file you can edit.

How the name is built

Receipts use the template:

Receipt {datetime} {store} {SNAP EBT if present} {scanner-app} {your notes}
  • {datetime} — parsed from the receipt (YYYY-MM-DD HHMM); many date/time formats are understood (US MM/DD/YYYY, ISO YYYY-MM-DD, Jan 5, 2024, 12h/24h times). If no date is found it falls back to Receipt {store} ….
  • {store} — matched against a known-store list (Safeway, Costco, Trader Joe's, Whole Foods, Target, Walmart, Kroger, CVS, Walgreens, Aldi, Sprouts, and more — all editable).
  • SNAP EBT — inserted only when a SNAP/EBT line is detected.
  • {scanner-app} — detected from the OCR text or the original filename (Genius Scan, Adobe Scan, CamScanner, Microsoft Lens, …).
  • {your notes} — anything you pass with --notes.

Empty fields collapse cleanly — no double spaces, no dangling separators.

Documents (anything not detected as a receipt) use:

{title} {course-code aliases} {scanner-app} {your notes}

Install

Requires the Tesseract OCR engine on your PATH, plus poppler if you want to OCR PDFs.

# macOS
brew install tesseract poppler
# Debian/Ubuntu
sudo apt-get install tesseract-ocr poppler-utils

pip install -e .

Usage

# Dry run over a folder (prints planned renames, changes nothing):
receipt-renamer batch ~/Scans

# Apply the renames:
receipt-renamer batch ~/Scans --commit

# Move renamed files into a tidy archive instead of renaming in place:
receipt-renamer batch ~/Scans --commit --dest ~/Receipts

# One file, with a note:
receipt-renamer one ~/Scans/IMG_0001.jpg --commit --notes "reimburse work"

# Watch the folder and rename new scans as they arrive:
receipt-renamer watch ~/Scans --commit

# Print the default config so you can copy and tweak it:
receipt-renamer dump-config > my-rules.yaml
receipt-renamer batch ~/Scans --config my-rules.yaml --commit

Dry run is the default. Nothing is renamed until you pass --commit.

Configuration

Run receipt-renamer dump-config to see the full annotated default. Highlights:

  • stores — canonical name + aliases/spellings. Whole-word matching, so Target won't match "targeting". Longest matching alias wins.
  • snap_patterns — regexes that flag a SNAP/EBT receipt.
  • receipt_signals / receipt_min_signals — a page is a receipt if a known store is found, or if it hits at least this many signals (TOTAL, TAX, $x.xx, card brands, …). This catches receipts from stores not in your list.
  • courses — both a generic DEPT + number pattern (with a subject_map so CHEChem) and explicit rules (OChemOrganic Chemistry + the CHE8A/CHE 8A/Chem 8A family).
  • templates / datetime_format — the output filename shapes.

Architecture

The core operates entirely on text, never on pixels, so it's fast to test and the OCR backend is swappable:

module responsibility
config load + validate the YAML rule table
ocr pluggable OCR (OcrFn); default backend = Tesseract via pytesseract / pdf2image
stores whole-word store recognition
receipts date/time parsing + SNAP/EBT detection
courses course-code expansion
classify receipt vs document decision
rename filename assembly, sanitization, collision-safe targets (pure)
processor OCR → plan → (optional) rename, never throws on a bad file
watcher watchdog folder watcher with a write-settle delay
cli batch / one / watch / dump-config

Tests

pip install -e ".[test]"
pytest

The bulk of the suite runs against saved OCR-text fixtures (in tests/fixtures/) and a mocked OCR function, so it needs no Tesseract. tests/test_ocr_end_to_end.py renders a synthetic receipt with Pillow and runs the real Tesseract pipeline; it skips automatically if the binary is absent.

Notes & limitations

  • OCR quality is bounded by Tesseract and the scan. Faded thermal receipts and skewed photos will parse worse; the heuristics are deliberately forgiving.
  • Date parsing favours the first plausible date on the page. Very unusual layouts may pick the wrong one — review a dry run before --commit.
  • For PDFs, only the first couple of pages are OCR'd (configurable via --pdf-pages).

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

receipt_renamer-0.1.0.tar.gz (24.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

receipt_renamer-0.1.0-py3-none-any.whl (21.7 kB view details)

Uploaded Python 3

File details

Details for the file receipt_renamer-0.1.0.tar.gz.

File metadata

  • Download URL: receipt_renamer-0.1.0.tar.gz
  • Upload date:
  • Size: 24.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for receipt_renamer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1dfd727710f85abccae74e1170c4ef2a45bd58dba3bc41ef3139d8e07d12ef4b
MD5 71fd3458e8cfa64b75d45c7f3932adda
BLAKE2b-256 840623ba9cbff9486a61bbc25c79c12f4617a37e45890eb39342690b71ed0a83

See more details on using hashes here.

File details

Details for the file receipt_renamer-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for receipt_renamer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8d82708b17485adc5cb3232db5f3cf908118f24bc2c256297bbbd93c3be8a266
MD5 6d6340ad939322ef00a527a21e89029c
BLAKE2b-256 2fc03bf96af7144438d807c74c1590b4bab82fd21b9e605dfefb8478a8026b64

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page