Receipt and bill parser using OCR
Project description
receiptparser
Summary
A receipt and bill parser written in Python. Can be used as a Python module or CLI tool.
It was originally based on receipt-parser, but has effectively been completely rewritten/replaced.
So far, only German receipts are supported, but other countries can be added using a simple YAML configuration file.
Recognition rate
To develop this tool, I used a set of 182 receipts in varying quality. Some of the were crumpled, most had been folded, etc. The result on this set of receipts is:
Total: 182
Company found: 171
Postal code found: 158
Date found: 159
Amount found: 114
If your receipts are sharp, uncrumpled, and have good contrast, I would expect a 97%-99% success rate, except for the total amount, which is harder to identify correctly. That may be closer to 75%.
Where applicable, I chose automation and quality over performance. For example, receiptparser scans every image twice, once unsharpened, and once sharpened, which raises the recognition rate around 6% but doubles the scan time.
Installation
Prerequisites
- Python 3
- PIP3
- tesseract
Install via PIP
pip3 install receiptparser
Install via Git
pip3 install -r requirements.txt
pip3 install .
Python usage
from receiptparser.config import read_config
from receiptparser.parser import process_receipt
config = read_config('my_config.yml')
receipt = process_receipt(config, "my_receipt.jpg", out_dir=None, verbosity=0)
print("Filename: ", receipt.filename)
print("Company: ", receipt.company)
print("Postal code:", receipt.postal)
print("Date: ", receipt.date)
print("Amount: ", receipt.sum)
CLI Usage
Examples
A simple example to read all images (.jpg) from a directory and print the recognized data to stdout:
receiptparser tests/data/germany/img/
You can customize the output as follows:
receiptparser -v0 --format "{date:%Y-%m-%d} - {company} - {postal} - {sum}.jpg" tests/data/germany/img/
In this case, -v0
suppresses any output, except for what you specify in the --format FORMAT
parameter. FORMAT is a Python format string as specified here.
The following values can be used in the format string:
- company: The recognized name of the company
- postal: The recognized postal code of the company
- date: The recognized date of the bill or receipt
- sum: The dollar (or Euro, or other currency) amount of the bill or receipt
Syntax
usage: receiptparser [-h] [-c CONFIG] [--config-file CONFIG_FILE] [-t TESSERACT] [-f FORMAT] [-v {0,1,2}] input
positional arguments:
input file or directory from which images will be read
optional arguments:
-h, --help show this help message and exit
-c CONFIG, --config CONFIG
built-in config to use
--config-file CONFIG_FILE
like -c, but point to a file instead
-t TESSERACT, --tesseract TESSERACT
output directory for OCR recognized text (default is to discard)
-f FORMAT, --format FORMAT
format of the recognized output. default is pretty-printing
-v {0,1,2}, --verbosity {0,1,2}
increase output verbosity
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file receiptparser-1.1-py2.py3-none-any.whl
.
File metadata
- Download URL: receiptparser-1.1-py2.py3-none-any.whl
- Upload date:
- Size: 11.0 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f4de7ce67f8017b7cab0ed16ada44c04b2d05440a4214c6d9dc05807c63e13e1 |
|
MD5 | fd7856c78709160d14b11995287d3373 |
|
BLAKE2b-256 | 90ac6d1a6f0626dbb66027ba9832429ddfd213d535525555808264889f594b2a |