Python parser to extract data from pdf invoice
Project description
Data extractor for PDF invoices - invoice2data
A command line tool and Python library to support your accounting process.
- extracts text from PDF files using different techniques, like
pdftotext
,pdfminer
or OCR --tesseract
,tesseract4
orgvision
(Google Cloud Vision). - searches for regex in the result using a YAML-based template system
- saves results as CSV, JSON or XML or renames PDF files to match the content.
With the flexible template system you can:
- precisely match content PDF files
- plugins available to match line items and tables
- define static fields that are the same for every invoice
- define custom fields needed in your organisation or process
- have multiple regex per field (if layout or wording changes)
- define currency
- extract invoice-items using the
lines
-plugin developed by Holger Brunn
Go from PDF files to this:
{'date': (2014, 5, 7), 'invoice_number': '30064443', 'amount': 34.73, 'desc': 'Invoice 30064443 from QualityHosting', 'lines': [{'price': 42.0, 'desc': u'Small Business StandardExchange 2010\nGrundgeb\xfchr pro Einheit\nDienst: OUDJQ_office\n01.05.14-31.05.14\n', 'pos': u'7', 'qty': 1.0}]}
{'date': (2014, 6, 4), 'invoice_number': 'EUVINS1-OF5-DE-120725895', 'amount': 35.24, 'desc': 'Invoice EUVINS1-OF5-DE-120725895 from Amazon EU'}
{'date': (2014, 8, 3), 'invoice_number': '42183017', 'amount': 4.11, 'desc': 'Invoice 42183017 from Amazon Web Services'}
{'date': (2015, 1, 28), 'invoice_number': '12429647', 'amount': 101.0, 'desc': 'Invoice 12429647 from Envato'}
Installation
- Install pdftotext
If possible get the latest
xpdf/poppler-utils version. It's
included with macOS Homebrew, Debian and Ubuntu. Without it, pdftotext
won't parse tables in PDF correctly.
-
Install
invoice2data
using pippip install invoice2data
Usage
Basic usage. Process PDF files and write result to CSV.
invoice2data invoice.pdf
invoice2data *.pdf
Choose any of the following input readers:
- pdftotext
invoice2data --input-reader pdftotext invoice.pdf
- tesseract
invoice2data --input-reader tesseract invoice.pdf
- pdf miner
invoice2data --input-reader pdfminer invoice.pdf
- tesseract4
invoice2data --input-reader tesseract4 invoice.pdf
- gvision
invoice2data --input-reader gvision invoice.pdf
(needsGOOGLE_APPLICATION_CREDENTIALS
env var)
Choose any of the following output formats:
- csv
invoice2data --output-format csv invoice.pdf
- json
invoice2data --output-format json invoice.pdf
- xml
invoice2data --output-format xml invoice.pdf
Save output file with custom name or a specific folder
invoice2data --output-format csv --output-name myinvoices/invoices.csv invoice.pdf
Note: You must specify the output-format
in order to create
output-name
Specify folder with yml templates. (e.g. your suppliers)
invoice2data --template-folder ACME-templates invoice.pdf
Only use your own templates and exclude built-ins
invoice2data --exclude-built-in-templates --template-folder ACME-templates invoice.pdf
Processes a folder of invoices and copies renamed invoices to new folder.
invoice2data --copy new_folder folder_with_invoices/*.pdf
Processes a single file and dumps whole file for debugging (useful when adding new templates in templates.py)
invoice2data --debug my_invoice.pdf
Recognize test invoices: invoice2data invoice2data/test/pdfs/* --debug
Use as Python Library
You can easily add invoice2data
to your own Python scripts as library.
from invoice2data import extract_data
result = extract_data('path/to/my/file.pdf')
Using in-house templates
from invoice2data import extract_data
from invoice2data.extract.loader import read_templates
templates = read_templates('/path/to/your/templates/')
result = extract_data(filename, templates=templates)
Template system
See invoice2data/extract/templates
for existing templates. Just extend
the list to add your own. If deployed by a bigger organisation, there
should be an interface to edit templates for new suppliers. 80-20 rule.
For a short tutorial on how to add new templates, see TUTORIAL.md.
Templates are based on Yaml. They define one or more keywords to find the right template, one or more exclude_keywords to further narrow it down and regexp for fields to be extracted. They could also be a static value, like the full company name.
Template files are tried in alphabetical order.
We may extend them to feature options to be used during invoice processing.
Example:
issuer: Amazon Web Services, Inc.
keywords:
- Amazon Web Services
exclude_keywords:
- San Jose
fields:
amount: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+)
amount_untaxed: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+)
date: Invoice Date:\s+([a-zA-Z]+ \d+ , \d+)
invoice_number: Invoice Number:\s+(\d+)
partner_name: (Amazon Web Services, Inc\.)
options:
remove_whitespace: false
currency: HKD
date_formats:
- '%d/%m/%Y'
lines:
start: Detail
end: \* May include estimated US sales tax
first_line: ^ (?P<description>\w+.*)\$(?P<price_unit>\d+\.\d+)
line: (.*)\$(\d+\.\d+)
last_line: VAT \*\*
Development
If you are interested in improving this project, have a look at our developer guide to get you started quickly.
Roadmap and open tasks
- integrate with online OCR?
- try to 'guess' parameters for new invoice formats.
- can apply machine learning to guess new parameters?
Maintainers
Contributors
- Harshit Joshi: As Google Summer of Code student.
- Holger Brunn: Add support for parsing invoice items.
Related Projects
- OCR-Invoice (FOSS | C#)
- DeepLogic AI (Commercial | SaaS)
- Docparser (Commercial | Web Service)
- A-PDF (Commercial)
- PDFdeconstruct (Commercial)
- CVision (Commercial)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.