Python parser to extract data from pdf invoice
Project description
# Data extractor for PDF invoices - invoice2data
[![Circle CI](https://circleci.com/gh/m3nu/invoice2data.svg?style=svg)](https://circleci.com/gh/m3nu/invoice2data)
A Python library to support your accounting process.
- extracts text from PDF files
- searches for regex in the result
- saves results as CSV
- optionally renames PDF files to match the content
With the flexible template system you can:
- precisely match PDF files
- define static fields that are the same for every invoice
- have multiple regex per field (if layout or wording changes)
- define currency
Go from PDF files to this:
```
{'date': (2014, 5, 7), 'invoice_number': '30064443', 'amount': 34.73, 'desc': 'Invoice 30064443 from QualityHosting'}
{'date': (2014, 6, 4), 'invoice_number': 'EUVINS1-OF5-DE-120725895', 'amount': 35.24, 'desc': 'Invoice EUVINS1-OF5-DE-120725895 from Amazon EU'}
{'date': (2014, 8, 3), 'invoice_number': '42183017', 'amount': 4.11, 'desc': 'Invoice 42183017 from Amazon Web Services'}
{'date': (2015, 1, 28), 'invoice_number': '12429647', 'amount': 101.0, 'desc': 'Invoice 12429647 from Envato'}
```
## Installation
1. Install pdftotext
We need the latest verion of `xpdf` because support for table layouts was only recenlty added. You can download the binary files from www.foolabs.com/xpdf/download.html
2. Install `invoice2data` using pip
```
pip install invoice2data
```
Optionally this uses `pdfminer`, but `pdftotext` works better. You can choose which module to use. No special Python packages are necessary at the moment, except for `pdftotext`.
There is also `tesseract` integration as a fallback, if no text can be extracted. But it may be more reliable to use
## Usage
Basic usage. Process PDF files and write result to CSV.
- `invoice2data invoice.pdf`
- `invoice2data *.pdf`
Specify folder with yml templates. (e.g. your suppliers)
`invoice2data --template-folder ACME-templates invoice.pdf`
Processes a folder of invoices and copies renamed invoices to new folder.
`invoice2data --copy new_folder folder_with_invoices/*.pdf`
Processes a single file and dumps whole file for debugging (useful when adding new templates in templates.py)
`invoice2data --debug my_invoice.pdf`
Recognize test invoices:
`invoice2data invoice2data/test/pdfs/* --debug`
If you want to use it as a lib just do
```
from invoice2data import extract_data
result = extract_data('path/to/my/file.pdf')
```
## Template system
See `invoice2data/templates` for existing templates. Just extend the list to add your own. If deployed by a bigger organisation, there should be an interface to edit templates for new suppliers. 80-20 rule.
Templates are based on Yaml. They define one or more keywords to find the right template and regexp for fields to be extracted. They could also be a static value, like the full company name.
We may extend them to feature options to be used during invoice processing.
Example:
```
issuer: Amazon Web Services, Inc.
keywords:
- Amazon Web Services
fields:
amount: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+)
amount_untaxed: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+)
date: Invoice Date:\s+([a-zA-Z]+ \d+ , \d+)
invoice_number: Invoice Number:\s+(\d+)
partner_name: (Amazon Web Services, Inc\.)
options:
remove_whitespace: false
currency: HKD
date_formats:
- '%d/%m/%Y'
```
## Roadmap
Currently this is a proof of concept. If you scan your invoices, this could easily be connected to an OCR system. Biggest weakness is the need to manually enter new regexes. I don't see an easy way to make it "learn" new patterns.
Planned features:
- integrate with online OCR
- try to 'guess' parameters for new invoice formats
- can apply machine learning to guess new parameters?
## Contributors
- Alexis de Lattre: Add setup.py for Pypi, fix locale bug, add templates for new invoice types.
[![Circle CI](https://circleci.com/gh/m3nu/invoice2data.svg?style=svg)](https://circleci.com/gh/m3nu/invoice2data)
A Python library to support your accounting process.
- extracts text from PDF files
- searches for regex in the result
- saves results as CSV
- optionally renames PDF files to match the content
With the flexible template system you can:
- precisely match PDF files
- define static fields that are the same for every invoice
- have multiple regex per field (if layout or wording changes)
- define currency
Go from PDF files to this:
```
{'date': (2014, 5, 7), 'invoice_number': '30064443', 'amount': 34.73, 'desc': 'Invoice 30064443 from QualityHosting'}
{'date': (2014, 6, 4), 'invoice_number': 'EUVINS1-OF5-DE-120725895', 'amount': 35.24, 'desc': 'Invoice EUVINS1-OF5-DE-120725895 from Amazon EU'}
{'date': (2014, 8, 3), 'invoice_number': '42183017', 'amount': 4.11, 'desc': 'Invoice 42183017 from Amazon Web Services'}
{'date': (2015, 1, 28), 'invoice_number': '12429647', 'amount': 101.0, 'desc': 'Invoice 12429647 from Envato'}
```
## Installation
1. Install pdftotext
We need the latest verion of `xpdf` because support for table layouts was only recenlty added. You can download the binary files from www.foolabs.com/xpdf/download.html
2. Install `invoice2data` using pip
```
pip install invoice2data
```
Optionally this uses `pdfminer`, but `pdftotext` works better. You can choose which module to use. No special Python packages are necessary at the moment, except for `pdftotext`.
There is also `tesseract` integration as a fallback, if no text can be extracted. But it may be more reliable to use
## Usage
Basic usage. Process PDF files and write result to CSV.
- `invoice2data invoice.pdf`
- `invoice2data *.pdf`
Specify folder with yml templates. (e.g. your suppliers)
`invoice2data --template-folder ACME-templates invoice.pdf`
Processes a folder of invoices and copies renamed invoices to new folder.
`invoice2data --copy new_folder folder_with_invoices/*.pdf`
Processes a single file and dumps whole file for debugging (useful when adding new templates in templates.py)
`invoice2data --debug my_invoice.pdf`
Recognize test invoices:
`invoice2data invoice2data/test/pdfs/* --debug`
If you want to use it as a lib just do
```
from invoice2data import extract_data
result = extract_data('path/to/my/file.pdf')
```
## Template system
See `invoice2data/templates` for existing templates. Just extend the list to add your own. If deployed by a bigger organisation, there should be an interface to edit templates for new suppliers. 80-20 rule.
Templates are based on Yaml. They define one or more keywords to find the right template and regexp for fields to be extracted. They could also be a static value, like the full company name.
We may extend them to feature options to be used during invoice processing.
Example:
```
issuer: Amazon Web Services, Inc.
keywords:
- Amazon Web Services
fields:
amount: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+)
amount_untaxed: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+)
date: Invoice Date:\s+([a-zA-Z]+ \d+ , \d+)
invoice_number: Invoice Number:\s+(\d+)
partner_name: (Amazon Web Services, Inc\.)
options:
remove_whitespace: false
currency: HKD
date_formats:
- '%d/%m/%Y'
```
## Roadmap
Currently this is a proof of concept. If you scan your invoices, this could easily be connected to an OCR system. Biggest weakness is the need to manually enter new regexes. I don't see an easy way to make it "learn" new patterns.
Planned features:
- integrate with online OCR
- try to 'guess' parameters for new invoice formats
- can apply machine learning to guess new parameters?
## Contributors
- Alexis de Lattre: Add setup.py for Pypi, fix locale bug, add templates for new invoice types.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
No source distribution files available for this release.See tutorial on generating distribution archives.
Built Distribution
invoice2data-0.2.10-py2.7.egg
(385.5 kB
view details)
File details
Details for the file invoice2data-0.2.10-py2.7.egg
.
File metadata
- Download URL: invoice2data-0.2.10-py2.7.egg
- Upload date:
- Size: 385.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f323e7f00f3439c22b7843617d3da780a858537fe0a277dbe8a0e9c9abdc958d |
|
MD5 | 9d95d5e9a00561f3595939025af4cc1c |
|
BLAKE2b-256 | 3e73820ca6bf6a49c9d6a3c771f0ff1dcca2dcd9f09acf18320c46ecec823809 |