Skip to main content

Parse and convert REWE eBons (digital receipts) to JSON.

Project description

REWE eBon Parser

The REWE eBon Parser is a Python package designed to parse REWE eBons (receipts) from PDF files and convert them into structured JSON format. The package also provides functionality to output raw text extracted from the PDFs for debugging purposes. This project is a re-write of the the rewe-ebon-parser TypeScript library, example PDFs are borrowed from the same library.

Features

  • Parse individual PDF files or entire folders containing PDF files.
  • Output parsed data as JSON.
  • Extract and output raw text from PDF files (bascially, the output of the underlying pdfplumber).
  • Concurrent processing of multiple PDF files with adjustable threading.
  • Detailed logging of processing results in CSV format.

Installation

You can install the package using pip:

pip install rewe-ebon-parser

Usage

You can find PDF receipt files to test on in the examples/eBons folder in this repo borrowed from rewe-ebon-parser.

Command Line Interface (CLI)

Parse a Single PDF File and save to JSON

rewe-ebon-parser [--file] <input_pdf_path> [output_json_path]

Example:

rewe-ebon-parser examples/eBons/1.pdf

Parsing Multiple PDF Files in a Folder

rewe-ebon-parser [--folder] <input_folder> [output_folder] [--nthreads <number_of_threads>] 

Example:

rewe-ebon-parser examples/eBons/

Optional Arguments

  • --file: Explicitly specify if the input and output paths are files.
  • --folder: Explicitly specify if the input and output paths are folders.
  • --nthreads: Number of concurrent threads to use for processing files.
  • --rawtext-file: Output raw text extracted from the PDF files to .txt files (mostly for debugging).
  • --rawtext-stdout: Print raw text extracted from the PDF files to the console (mostly for debugging).
  • --version: show module version.
  • -h, --help: show help.

Auto-detection Mode

If neither --file nor --folder is specified, the script will automatically detect if the input path is a file or a folder and process accordingly.

Output

  • If output_json_path is not specified for a single file, the output will be saved in the same directory as the input file with a .json extension.
  • If output_folder is not specified for a folder, a subfolder named rewe_json_out will be created in the input folder, and the output JSON files will be saved there.

Logging

A detailed log of processing results will be saved in the output folder as processing_log.csv, containing information on which files were successfully processed and which failed, along with error messages if any.

Use as a Python module in your own Python code

Direct use on files

from rewe_ebon_parser.parse import parse_pdf_ebon

parse_pdf_ebon("examples/eBons/1.pdf")

Passing a data_buffer: bytes

from rewe_ebon_parser.parse import parse_ebon

# here the function is once again getting the data from a file,
# but input can come from anywhere
def process_pdf(pdf_path):
    with open(pdf_path, 'rb') as f:
        data = f.read()
        result = parse_ebon(data)
        return result

process_pdf("examples/eBons/1.pdf")

License

This project is licensed under the MIT License. For details see LICENSE file.

Caveats

So far the module reliably parses the items, but sometimes fails on PAYBACK points, as these are often presented differently in REWE receipts.

Future Work

  • Dump all shopping items into a single CSV file with purchase dates.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rewe_ebon_parser-0.0.5.tar.gz (10.2 kB view details)

Uploaded Source

Built Distribution

rewe_ebon_parser-0.0.5-py3-none-any.whl (10.9 kB view details)

Uploaded Python 3

File details

Details for the file rewe_ebon_parser-0.0.5.tar.gz.

File metadata

  • Download URL: rewe_ebon_parser-0.0.5.tar.gz
  • Upload date:
  • Size: 10.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for rewe_ebon_parser-0.0.5.tar.gz
Algorithm Hash digest
SHA256 484df4c2438487a3c7a2c9ad400097f69a94e1b130a296871860400d31d7c77a
MD5 d5349fd6280a468278fdbad3b4f4a065
BLAKE2b-256 935dd2b3c9728d611e9a0b3195c13f2e9211f8ae87259c7063989399bde92bf3

See more details on using hashes here.

File details

Details for the file rewe_ebon_parser-0.0.5-py3-none-any.whl.

File metadata

File hashes

Hashes for rewe_ebon_parser-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 698dffb182d875ad5aa69378eb682f48508fe3d7129129d78f39f718d68605ff
MD5 ac6e1fb36f6affe44730e5db6527f012
BLAKE2b-256 509dea84efdc135a74d194fd0c29e416bfd282ab9c16331fb1b512950f0ae2a0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page