Python parser to extract data from pdf invoice
Project description
This project has been selected for GSoC 2018. Read more here.
A modular Python library to support your accounting process. Tested on Python 2.7 and 3.4+. Main steps:
extracts text from PDF files using different techniques, like pdftotext, pdfminer or OCR – tesseract, tesseract4 or gvision (Google Cloud Vision).
searches for regex in the result using a YAML-based template system
saves results as CSV, JSON or XML or renames PDF files to match the content.
With the flexible template system you can:
precisely match content PDF files
plugins available to match line items and tables
define static fields that are the same for every invoice
define custom fields needed in your organisation or process
have multiple regex per field (if layout or wording changes)
define currency
extract invoice-items using the lines-plugin developed by Holger Brunn
Go from PDF files to this:
{'date': (2014, 5, 7), 'invoice_number': '30064443', 'amount': 34.73, 'desc': 'Invoice 30064443 from QualityHosting', 'lines': [{'price': 42.0, 'desc': u'Small Business StandardExchange 2010\nGrundgeb\xfchr pro Einheit\nDienst: OUDJQ_office\n01.05.14-31.05.14\n', 'pos': u'7', 'qty': 1.0}]} {'date': (2014, 6, 4), 'invoice_number': 'EUVINS1-OF5-DE-120725895', 'amount': 35.24, 'desc': 'Invoice EUVINS1-OF5-DE-120725895 from Amazon EU'} {'date': (2014, 8, 3), 'invoice_number': '42183017', 'amount': 4.11, 'desc': 'Invoice 42183017 from Amazon Web Services'} {'date': (2015, 1, 28), 'invoice_number': '12429647', 'amount': 101.0, 'desc': 'Invoice 12429647 from Envato'}
Installation
Install pdftotext
If possible get the latest xpdf/poppler-utils version. It’s included with macOS Homebrew, Debian and Ubuntu. Without it, pdftotext won’t parse tables in PDF correctly.
Install invoice2data using pip
pip install invoice2data
Usage
Basic usage. Process PDF files and write result to CSV.
invoice2data invoice.pdf
invoice2data *.pdf
Choose any of the following input readers:
pdftotext invoice2data --input-reader pdftotext invoice.pdf
tesseract invoice2data --input-reader tesseract invoice.pdf
pdf miner invoice2data --input-reader pdfminer invoice.pdf
tesseract4 invoice2data --input-reader tesseract4 invoice.pdf
gvision invoice2data --input-reader gvision invoice.pdf (needs GOOGLE_APPLICATION_CREDENTIALS env var)
Choose any of the following output formats:
csv invoice2data --output-format csv invoice.pdf
json invoice2data --output-format json invoice.pdf
xml invoice2data --output-format xml invoice.pdf
Save output file with custom name or a specific folder invoice2data --output-format csv --output-name myinvoices/invoices.csv invoice.pdf
Note: You must specify the output-format in order to create output-name
Specify folder with yml templates. (e.g. your suppliers) invoice2data --template-folder ACME-templates invoice.pdf
Only use your own templates and exclude built-ins invoice2data --exclude-built-in-templates --template-folder ACME-templates invoice.pdf
Processes a folder of invoices and copies renamed invoices to new folder. invoice2data --copy new_folder folder_with_invoices/*.pdf
Processes a single file and dumps whole file for debugging (useful when adding new templates in templates.py) invoice2data --debug my_invoice.pdf
Recognize test invoices: invoice2data invoice2data/test/pdfs/* --debug
If you want to use it as a lib just do
from invoice2data import extract_data result = extract_data('path/to/my/file.pdf')
Template system
See invoice2data/extract/templates for existing templates. Just extend the list to add your own. If deployed by a bigger organisation, there should be an interface to edit templates for new suppliers. 80-20 rule. For a short tutorial on how to add new templates, see TUTORIAL.rst.
Templates are based on Yaml. They define one or more keywords to find the right template and regexp for fields to be extracted. They could also be a static value, like the full company name.
Template files are tried in alphabetical order.
We may extend them to feature options to be used during invoice processing.
Example:
issuer: Amazon Web Services, Inc. keywords: - Amazon Web Services fields: amount: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+) amount_untaxed: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+) date: Invoice Date:\s+([a-zA-Z]+ \d+ , \d+) invoice_number: Invoice Number:\s+(\d+) partner_name: (Amazon Web Services, Inc\.) options: remove_whitespace: false currency: HKD date_formats: - '%d/%m/%Y' lines: start: Detail end: \* May include estimated US sales tax first_line: ^ (?P<description>\w+.*)\$(?P<price_unit>\d+\.\d+) line: (.*)\$(\d+\.\d+) last_line: VAT \*\*
Development
If you are interested in improving this project, have a look at our developer guide to get you started quickly.
Roadmap and open tasks
integrate with online OCR?
try to ‘guess’ parameters for new invoice formats.
can apply machine learning to guess new parameters?
Maintainers
Contributors
Harshit Joshi: As Google Summer of Code student.
Holger Brunn: Add support for parsing invoice items.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file invoice2data-0.3.3.tar.gz
.
File metadata
- Download URL: invoice2data-0.3.3.tar.gz
- Upload date:
- Size: 757.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.29.1 CPython/3.7.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ea403d6fea3ff65b979316fb194ecf5106011740f1d34add7069c7ee7d7d0dde |
|
MD5 | 69176a389e615eea949d5b2f9ec74174 |
|
BLAKE2b-256 | 1c20aef831736f5fb1fc85b2916b01e6370fb4f3320692dfc5504b08c40a3376 |
File details
Details for the file invoice2data-0.3.3-py3.6.egg
.
File metadata
- Download URL: invoice2data-0.3.3-py3.6.egg
- Upload date:
- Size: 440.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.29.1 CPython/3.7.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0cb9a0d5b84f330bd30f0ef2564e4591646f19840b2d2201f8c6edda89241d0b |
|
MD5 | 7484291427d696db6fb6c67924928268 |
|
BLAKE2b-256 | 7d21e6838dd407b356addaa971c339798602db39e0b8bf8ae4387abd39a6d876 |