Skip to main content

Python parser to extract data from pdf invoice

Project description

# Data extractor for PDF invoices - invoice2data

[![Circle CI](https://circleci.com/gh/m3nu/invoice2data.svg?style=svg)](https://circleci.com/gh/m3nu/invoice2data)

A Python library to support your accounting process. Tested on Python 2.7, 3.4 and 3.5

- extracts text from PDF files
- searches for regex in the result
- saves results as CSV
- optionally renames PDF files to match the content

With the flexible template system you can:

- precisely match PDF files
- define static fields that are the same for every invoice
- have multiple regex per field (if layout or wording changes)
- define currency

Go from PDF files to this:

```
{'date': (2014, 5, 7), 'invoice_number': '30064443', 'amount': 34.73, 'desc': 'Invoice 30064443 from QualityHosting'}
{'date': (2014, 6, 4), 'invoice_number': 'EUVINS1-OF5-DE-120725895', 'amount': 35.24, 'desc': 'Invoice EUVINS1-OF5-DE-120725895 from Amazon EU'}
{'date': (2014, 8, 3), 'invoice_number': '42183017', 'amount': 4.11, 'desc': 'Invoice 42183017 from Amazon Web Services'}
{'date': (2015, 1, 28), 'invoice_number': '12429647', 'amount': 101.0, 'desc': 'Invoice 12429647 from Envato'}
```

## Installation

1. Install pdftotext

If possible get the latest [xpdf/poppler-utils](https://poppler.freedesktop.org/) version. It's included with OSX Homebrew, Debian Sid and Ubuntu 16.04. Without it, `pdftotext` won't parse tables in PDF correctly.

2. Install `invoice2data` using pip

```
pip install invoice2data
```

Optionally this uses `pdfminer`, but `pdftotext` works better. You can choose which module to use. No special Python packages are necessary at the moment, except for `pdftotext`.

There is also `tesseract` integration as a fallback, if no text can be extracted. But it may be more reliable to use

## Usage

Basic usage. Process PDF files and write result to CSV.
- `invoice2data invoice.pdf`
- `invoice2data *.pdf`

Specify folder with yml templates. (e.g. your suppliers)
`invoice2data --template-folder ACME-templates invoice.pdf`

Only use your own templates and exclude built-ins
`invoice2data --exclude-built-in-templates --template-folder ACME-templates invoice.pdf`

Processes a folder of invoices and copies renamed invoices to new folder.
`invoice2data --copy new_folder folder_with_invoices/*.pdf`

Processes a single file and dumps whole file for debugging (useful when adding new templates in templates.py)
`invoice2data --debug my_invoice.pdf`

Recognize test invoices:
`invoice2data invoice2data/test/pdfs/* --debug`

If you want to use it as a lib just do

```
from invoice2data import extract_data

result = extract_data('path/to/my/file.pdf')
```

## Template system

See `invoice2data/templates` for existing templates. Just extend the list to add your own. If deployed by a bigger organisation, there should be an interface to edit templates for new suppliers. 80-20 rule. For a short tutorial on how to add new templates, see [TUTORIAL.md](TUTORIAL.md).

Templates are based on Yaml. They define one or more keywords to find the right template and regexp for fields to be extracted. They could also be a static value, like the full company name.

We may extend them to feature options to be used during invoice processing.

Example:

```
issuer: Amazon Web Services, Inc.
keywords:
- Amazon Web Services
fields:
amount: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+)
amount_untaxed: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+)
date: Invoice Date:\s+([a-zA-Z]+ \d+ , \d+)
invoice_number: Invoice Number:\s+(\d+)
partner_name: (Amazon Web Services, Inc\.)
options:
remove_whitespace: false
currency: HKD
date_formats:
- '%d/%m/%Y'
```

## Roadmap and open tasks

- tutorial and documentation for template options.
- parse invoice items.
- integrate with online OCR?
- try to 'guess' parameters for new invoice formats.
- can apply machine learning to guess new parameters?

## Maintainers
- [Manuel Riel](https://github.com/m3nu)
- [Alexis de Lattre](https://github.com/alexis-via): Add setup.py for Pypi, fix locale bug, add templates for new invoice types.

## Other Contributors
- [Holger Brunn](https://github.com/hbrunn): Add support for parsing invoice items.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

invoice2data-0.2.42.tar.gz (363.1 kB view details)

Uploaded Source

Built Distribution

invoice2data-0.2.42-py2.7.egg (393.6 kB view details)

Uploaded Source

File details

Details for the file invoice2data-0.2.42.tar.gz.

File metadata

File hashes

Hashes for invoice2data-0.2.42.tar.gz
Algorithm Hash digest
SHA256 db9f00f23b8dc6be508c7879798ba5ed807860347615b46deb9ad80443d8b89d
MD5 8b92c97291a5c1034d241c1ed4070360
BLAKE2b-256 3fc9ba57a383096ad941fe7be99966d15b285934de38235ad6373c137d1e0516

See more details on using hashes here.

File details

Details for the file invoice2data-0.2.42-py2.7.egg.

File metadata

File hashes

Hashes for invoice2data-0.2.42-py2.7.egg
Algorithm Hash digest
SHA256 b2f3c09f07af96bf5a48cdc78dbd2176fa8187490500ccc42e40b3c2384d2b03
MD5 de99847a49eb3bee736c7e690e416fd2
BLAKE2b-256 3836f657b03db6dbc457342195e1455de1379a2011b46c9aae7a90bd1b72c5ff

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page