Skip to main content

Python parser to extract data from pdf invoice

Project description

Circle CI

This project has been selected for GSoC 2018. Read more here.

A modular Python library to support your accounting process. Tested on Python 2.7 and 3.4+. Main steps:

  1. extracts text from PDF files using different techniques, like pdftotext, pdfminer or tesseract OCR.

  2. searches for regex in the result using a YAML-based template system

  3. saves results as CSV, JSON or XML or renames PDF files to match the content.

With the flexible template system you can:

  • precisely match content PDF files

  • define static fields that are the same for every invoice

  • define custom fields needed in your organisation or process

  • have multiple regex per field (if layout or wording changes)

  • define currency

  • extract invoice-items using the lines-plugin developed by Holger Brunn

Go from PDF files to this:

{'date': (2014, 5, 7), 'invoice_number': '30064443', 'amount': 34.73, 'desc': 'Invoice 30064443 from QualityHosting', 'lines': [{'price': 42.0, 'desc': u'Small Business StandardExchange 2010\nGrundgeb\xfchr pro Einheit\nDienst: OUDJQ_office\n01.05.14-31.05.14\n', 'pos': u'7', 'qty': 1.0}]}
{'date': (2014, 6, 4), 'invoice_number': 'EUVINS1-OF5-DE-120725895', 'amount': 35.24, 'desc': 'Invoice EUVINS1-OF5-DE-120725895 from Amazon EU'}
{'date': (2014, 8, 3), 'invoice_number': '42183017', 'amount': 4.11, 'desc': 'Invoice 42183017 from Amazon Web Services'}
{'date': (2015, 1, 28), 'invoice_number': '12429647', 'amount': 101.0, 'desc': 'Invoice 12429647 from Envato'}

Installation

  1. Install pdftotext

If possible get the latest xpdf/poppler-utils version. It’s included with macOS Homebrew, Debian and Ubuntu. Without it, pdftotext won’t parse tables in PDF correctly.

  1. Install invoice2data using pip

pip install invoice2data

Usage

Basic usage. Process PDF files and write result to CSV.

  • invoice2data invoice.pdf

  • invoice2data *.pdf

Choose any of the following three input readers:

  • pdftotext invoice2data --input-reader pdftotext invoice.pdf

  • tesseract invoice2data --input-reader tesseract invoice.pdf

  • pdf miner invoice2data --input-reader pdfminer invoice.pdf

Choose any of the following output formats:

  • csv invoice2data --output-format csv invoice.pdf

  • json invoice2data --output-format json invoice.pdf

  • xml invoice2data --output-format xml invoice.pdf

Save output file with custom name or a specific folder invoice2data --output-format csv --output-name myinvoices/invoices.csv invoice.pdf

Note: You must specify the output-format in order to create output-name

Specify folder with yml templates. (e.g. your suppliers) invoice2data --template-folder ACME-templates invoice.pdf

Only use your own templates and exclude built-ins invoice2data --exclude-built-in-templates --template-folder ACME-templates invoice.pdf

Processes a folder of invoices and copies renamed invoices to new folder. invoice2data --copy new_folder folder_with_invoices/*.pdf

Processes a single file and dumps whole file for debugging (useful when adding new templates in templates.py) invoice2data --debug my_invoice.pdf

Recognize test invoices: invoice2data invoice2data/test/pdfs/* --debug

If you want to use it as a lib just do

from invoice2data import extract_data

result = extract_data('path/to/my/file.pdf')

Template system

See invoice2data/templates for existing templates. Just extend the list to add your own. If deployed by a bigger organisation, there should be an interface to edit templates for new suppliers. 80-20 rule. For a short tutorial on how to add new templates, see TUTORIAL.rst.

Templates are based on Yaml. They define one or more keywords to find the right template and regexp for fields to be extracted. They could also be a static value, like the full company name.

Template files are tried in alphabetical order.

We may extend them to feature options to be used during invoice processing.

Example:

issuer: Amazon Web Services, Inc.
keywords:
- Amazon Web Services
fields:
  amount: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+)
  amount_untaxed: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+)
  date: Invoice Date:\s+([a-zA-Z]+ \d+ , \d+)
  invoice_number: Invoice Number:\s+(\d+)
  partner_name: (Amazon Web Services, Inc\.)
options:
  remove_whitespace: false
  currency: HKD
  date_formats:
    - '%d/%m/%Y'
lines:
    start: Detail
    end: \* May include estimated US sales tax
    first_line: ^    (?P<description>\w+.*)\$(?P<price_unit>\d+\.\d+)
    line: (.*)\$(\d+\.\d+)
    last_line: VAT \*\*

Development

If you are interested in improving this project, have a look at our developer guide to get you started quickly.

Roadmap and open tasks

  • integrate with online OCR?

  • try to ‘guess’ parameters for new invoice formats.

  • can apply machine learning to guess new parameters?

Maintainers

Contributors

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

invoice2data-0.2.100.tar.gz (29.2 kB view details)

Uploaded Source

Built Distribution

invoice2data-0.2.100-py2.7.egg (89.7 kB view details)

Uploaded Source

File details

Details for the file invoice2data-0.2.100.tar.gz.

File metadata

  • Download URL: invoice2data-0.2.100.tar.gz
  • Upload date:
  • Size: 29.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/2.7.11

File hashes

Hashes for invoice2data-0.2.100.tar.gz
Algorithm Hash digest
SHA256 6b76f1f5b58e0938283e8680a4aea3d730310a97d90aefccd2910a5c665f331f
MD5 ea59ceb87615a0825acfb7cef94122cc
BLAKE2b-256 0c10166c710b8541a5f2cd02eb1c9be4ea0ea6be27b97f79d83cd4dc4eb74e4b

See more details on using hashes here.

File details

Details for the file invoice2data-0.2.100-py2.7.egg.

File metadata

  • Download URL: invoice2data-0.2.100-py2.7.egg
  • Upload date:
  • Size: 89.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/2.7.11

File hashes

Hashes for invoice2data-0.2.100-py2.7.egg
Algorithm Hash digest
SHA256 45f46893157a66d9687d6afbfd946244f92a8b7ab9efe6fb377bbed975332b54
MD5 8f0c6a6baa0ed49e723daf5225f83ef5
BLAKE2b-256 c3c3b2a38f5adb0afd64580a5a5ce8cd7e0b1acdd1540e18a2e686aa6dc12ef4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page