Parse and convert REWE eBons (digital receipts) to JSON.
Project description
REWE eBon Parser
The REWE eBon Parser is a Python package designed to parse REWE eBons (receipts) from PDF files and convert them into structured JSON format. The package also provides functionality to output raw text extracted from the PDFs for debugging purposes. This project is a re-write of the the rewe-ebon-parser
TypeScript library, example PDFs are borrowed from the same library.
Features
- Parse individual PDF files or entire folders containing PDF files.
- Output parsed data as JSON.
- Extract and output raw text from PDF files (bascially, the output of the underlying
pdfplumber
). - Concurrent processing of multiple PDF files with adjustable threading.
- Detailed logging of processing results in CSV format.
Installation
You can install the package using pip:
pip install rewe-ebon-parser
Usage
You can find PDF receipt files to test on in the examples/eBons
folder in this repo borrowed from rewe-ebon-parser
.
Command Line Interface (CLI)
Parse a Single PDF File and save to JSON
rewe-ebon-parser [--file] <input_pdf_path> [output_json_path]
Example:
rewe-ebon-parser examples/eBons/1.pdf
Parsing Multiple PDF Files in a Folder
rewe-ebon-parser [--folder] <input_folder> [output_folder] [--nthreads <number_of_threads>]
Example:
rewe-ebon-parser examples/eBons/
Optional Arguments
--file
: Explicitly specify if the input and output paths are files.--folder
: Explicitly specify if the input and output paths are folders.--nthreads
: Number of concurrent threads to use for processing files.--rawtext-file
: Output raw text extracted from the PDF files to .txt files (mostly for debugging).--rawtext-stdout
: Print raw text extracted from the PDF files to the console (mostly for debugging).--version
: show module version.-h
,--help
: show help.
Auto-detection Mode
If neither --file
nor --folder
is specified, the script will automatically detect if the input path is a file or a folder and process accordingly.
Output
- If
output_json_path
is not specified for a single file, the output will be saved in the same directory as the input file with a.json
extension. - If
output_folder
is not specified for a folder, a subfolder namedrewe_json_out
will be created in the input folder, and the output JSON files will be saved there.
Logging
A detailed log of processing results will be saved in the output folder as processing_log.csv
, containing information on which files were successfully processed and which failed, along with error messages if any.
Use as a Python module in your own Python code
Direct use on files
from rewe_ebon_parser.parse import parse_pdf_ebon
parse_pdf_ebon("examples/eBons/1.pdf")
Passing a data_buffer: bytes
from rewe_ebon_parser.parse import parse_ebon
# here the function is once again getting the data from a file,
# but input can come from anywhere
def process_pdf(pdf_path):
with open(pdf_path, 'rb') as f:
data = f.read()
result = parse_ebon(data)
return result
process_pdf("examples/eBons/1.pdf")
License
This project is licensed under the MIT License. For details see LICENSE file.
Caveats
So far the module reliably parses the items, but sometimes fails on PAYBACK points, as these are often presented differently in REWE receipts.
Future Work
- Dump all shopping items into a single CSV file with purchase dates.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file rewe_ebon_parser-0.0.5.tar.gz
.
File metadata
- Download URL: rewe_ebon_parser-0.0.5.tar.gz
- Upload date:
- Size: 10.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 484df4c2438487a3c7a2c9ad400097f69a94e1b130a296871860400d31d7c77a |
|
MD5 | d5349fd6280a468278fdbad3b4f4a065 |
|
BLAKE2b-256 | 935dd2b3c9728d611e9a0b3195c13f2e9211f8ae87259c7063989399bde92bf3 |
File details
Details for the file rewe_ebon_parser-0.0.5-py3-none-any.whl
.
File metadata
- Download URL: rewe_ebon_parser-0.0.5-py3-none-any.whl
- Upload date:
- Size: 10.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 698dffb182d875ad5aa69378eb682f48508fe3d7129129d78f39f718d68605ff |
|
MD5 | ac6e1fb36f6affe44730e5db6527f012 |
|
BLAKE2b-256 | 509dea84efdc135a74d194fd0c29e416bfd282ab9c16331fb1b512950f0ae2a0 |