Watch a scans folder and auto-rename receipt images/PDFs (and course/document scans) from their OCR'd text.
Project description
receipt-renamer
Point it at the folder your scanner app drops files into. For every receipt image/PDF it OCRs the page and renames it to something searchable:
Receipt 2024-03-14 1042 Safeway SNAP EBT Adobe Scan.jpg
Non-receipt scans (lecture notes, handouts) are handled by a separate rule
layer: it reads the page-top title and expands course codes into every form
you might later search for, so CHE2A Midterm Review.pdf becomes:
CHE2A Midterm Review CHE 2A Chem 2A Genius Scan.pdf
and OChem Lecture Notes.pdf becomes:
OChem Lecture Notes Organic Chemistry CHE8A CHE 8A Chem 8A.pdf
Everything — the store list, the SNAP/EBT patterns, the receipt heuristics, the scanner-app markers, the course rules, and the filename templates — lives in one YAML file you can edit.
How the name is built
Receipts use the template:
Receipt {datetime} {store} {SNAP EBT if present} {scanner-app} {your notes}
{datetime}— parsed from the receipt (YYYY-MM-DD HHMM); many date/time formats are understood (USMM/DD/YYYY, ISOYYYY-MM-DD,Jan 5, 2024, 12h/24h times). If no date is found it falls back toReceipt {store} ….{store}— matched against a known-store list (Safeway, Costco, Trader Joe's, Whole Foods, Target, Walmart, Kroger, CVS, Walgreens, Aldi, Sprouts, and more — all editable).SNAP EBT— inserted only when a SNAP/EBT line is detected.{scanner-app}— detected from the OCR text or the original filename (Genius Scan, Adobe Scan, CamScanner, Microsoft Lens, …).{your notes}— anything you pass with--notes.
Empty fields collapse cleanly — no double spaces, no dangling separators.
Documents (anything not detected as a receipt) use:
{title} {course-code aliases} {scanner-app} {your notes}
Install
Requires the Tesseract OCR engine on your PATH, plus poppler if you want to OCR PDFs.
# macOS
brew install tesseract poppler
# Debian/Ubuntu
sudo apt-get install tesseract-ocr poppler-utils
pip install -e .
Usage
# Dry run over a folder (prints planned renames, changes nothing):
receipt-renamer batch ~/Scans
# Apply the renames:
receipt-renamer batch ~/Scans --commit
# Move renamed files into a tidy archive instead of renaming in place:
receipt-renamer batch ~/Scans --commit --dest ~/Receipts
# One file, with a note:
receipt-renamer one ~/Scans/IMG_0001.jpg --commit --notes "reimburse work"
# Watch the folder and rename new scans as they arrive:
receipt-renamer watch ~/Scans --commit
# Print the default config so you can copy and tweak it:
receipt-renamer dump-config > my-rules.yaml
receipt-renamer batch ~/Scans --config my-rules.yaml --commit
Dry run is the default. Nothing is renamed until you pass
--commit.
Configuration
Run receipt-renamer dump-config to see the full annotated default. Highlights:
stores— canonical name + aliases/spellings. Whole-word matching, soTargetwon't match "targeting". Longest matching alias wins.snap_patterns— regexes that flag a SNAP/EBT receipt.receipt_signals/receipt_min_signals— a page is a receipt if a known store is found, or if it hits at least this many signals (TOTAL, TAX,$x.xx, card brands, …). This catches receipts from stores not in your list.courses— both a genericDEPT + numberpattern (with asubject_mapsoCHE→Chem) and explicitrules(OChem→Organic Chemistry+ theCHE8A/CHE 8A/Chem 8Afamily).templates/datetime_format— the output filename shapes.
Architecture
The core operates entirely on text, never on pixels, so it's fast to test and the OCR backend is swappable:
| module | responsibility |
|---|---|
config |
load + validate the YAML rule table |
ocr |
pluggable OCR (OcrFn); default backend = Tesseract via pytesseract / pdf2image |
stores |
whole-word store recognition |
receipts |
date/time parsing + SNAP/EBT detection |
courses |
course-code expansion |
classify |
receipt vs document decision |
rename |
filename assembly, sanitization, collision-safe targets (pure) |
processor |
OCR → plan → (optional) rename, never throws on a bad file |
watcher |
watchdog folder watcher with a write-settle delay |
cli |
batch / one / watch / dump-config |
Tests
pip install -e ".[test]"
pytest
The bulk of the suite runs against saved OCR-text fixtures (in
tests/fixtures/) and a mocked OCR function, so it needs no Tesseract.
tests/test_ocr_end_to_end.py renders a synthetic receipt with Pillow and runs
the real Tesseract pipeline; it skips automatically if the binary is absent.
Notes & limitations
- OCR quality is bounded by Tesseract and the scan. Faded thermal receipts and skewed photos will parse worse; the heuristics are deliberately forgiving.
- Date parsing favours the first plausible date on the page. Very unusual
layouts may pick the wrong one — review a dry run before
--commit. - For PDFs, only the first couple of pages are OCR'd (configurable via
--pdf-pages).
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file receipt_renamer-0.1.0.tar.gz.
File metadata
- Download URL: receipt_renamer-0.1.0.tar.gz
- Upload date:
- Size: 24.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1dfd727710f85abccae74e1170c4ef2a45bd58dba3bc41ef3139d8e07d12ef4b
|
|
| MD5 |
71fd3458e8cfa64b75d45c7f3932adda
|
|
| BLAKE2b-256 |
840623ba9cbff9486a61bbc25c79c12f4617a37e45890eb39342690b71ed0a83
|
File details
Details for the file receipt_renamer-0.1.0-py3-none-any.whl.
File metadata
- Download URL: receipt_renamer-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8d82708b17485adc5cb3232db5f3cf908118f24bc2c256297bbbd93c3be8a266
|
|
| MD5 |
6d6340ad939322ef00a527a21e89029c
|
|
| BLAKE2b-256 |
2fc03bf96af7144438d807c74c1590b4bab82fd21b9e605dfefb8478a8026b64
|