Skip to main content

Receipt and bill parser using OCR

Project description

receiptparser

Build Status Coverage Status Code Climate Documentation Status

Summary

A receipt and bill parser written in Python. Can be used as a Python module or CLI tool.

It was originally based on receipt-parser, but has effectively been completely rewritten/replaced.

So far, only German receipts are supported, but other countries can be added using a simple YAML configuration file.

Recognition rate

To develop this tool, I used a set of 182 receipts in varying quality. Some of the were crumpled, most had been folded, etc. The result on this set of receipts is:

Total:             182
Company found:     171
Postal code found: 158
Date found:        159
Amount found:      114

If your receipts are sharp, uncrumpled, and have good contrast, I would expect a 97%-99% success rate, except for the total amount, which is harder to identify correctly. That may be closer to 75%.

Where applicable, I chose automation and quality over performance. For example, receiptparser scans every image twice, once unsharpened, and once sharpened, which raises the recognition rate around 6% but doubles the scan time.

Installation

Prerequisites

  • Python 3
  • PIP3
  • tesseract

Install via PIP

pip3 install receiptparser

Install via Git

pip3 install -r requirements.txt
pip3 install .

Python usage

from receiptparser.config import read_config
from receiptparser.parser import process_receipt

config = read_config('my_config.yml')
receipt = process_receipt(config, "my_receipt.jpg", out_dir=None, verbosity=0)

print("Filename:   ", receipt.filename)
print("Company:    ", receipt.company)
print("Postal code:", receipt.postal)
print("Date:       ", receipt.date)
print("Amount:     ", receipt.sum)

CLI Usage

Examples

A simple example to read all images (.jpg) from a directory and print the recognized data to stdout:

receiptparser tests/data/germany/img/

You can customize the output as follows:

receiptparser -v0 --format "{date:%Y-%m-%d} - {company} - {postal} - {sum}.jpg" tests/data/germany/img/

In this case, -v0 suppresses any output, except for what you specify in the --format FORMAT parameter. FORMAT is a Python format string as specified here. The following values can be used in the format string:

  • company: The recognized name of the company
  • postal: The recognized postal code of the company
  • date: The recognized date of the bill or receipt
  • sum: The dollar (or Euro, or other currency) amount of the bill or receipt

Syntax

usage: receiptparser [-h] [-c CONFIG] [--config-file CONFIG_FILE] [-t TESSERACT] [-f FORMAT] [-v {0,1,2}] input

positional arguments:
  input                 file or directory from which images will be read

optional arguments:
  -h, --help            show this help message and exit
  -c CONFIG, --config CONFIG
                        built-in config to use
  --config-file CONFIG_FILE
                        like -c, but point to a file instead
  -t TESSERACT, --tesseract TESSERACT
                        output directory for OCR recognized text (default is to discard)
  -f FORMAT, --format FORMAT
                        format of the recognized output. default is pretty-printing
  -v {0,1,2}, --verbosity {0,1,2}
                        increase output verbosity

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

receiptparser-1.1-py2.py3-none-any.whl (11.0 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file receiptparser-1.1-py2.py3-none-any.whl.

File metadata

  • Download URL: receiptparser-1.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 11.0 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.24.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.2

File hashes

Hashes for receiptparser-1.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 f4de7ce67f8017b7cab0ed16ada44c04b2d05440a4214c6d9dc05807c63e13e1
MD5 fd7856c78709160d14b11995287d3373
BLAKE2b-256 90ac6d1a6f0626dbb66027ba9832429ddfd213d535525555808264889f594b2a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page