Python parser to extract data from pdf invoice
Project description
# Data extractor for PDF invoices - invoice2data
[![Circle CI](https://circleci.com/gh/m3nu/invoice2data.svg?style=svg)](https://circleci.com/gh/m3nu/invoice2data)
A Python library to support your accounting process.
- extracts text from PDF files
- searches for regex in the result
- saves results as CSV
- optionally renames PDF files to match the content
With the flexible template system you can:
- precisely match PDF files
- define static fields that are the same for every invoice
- have multiple regex per field (if layout or wording changes)
- define currency
Go from PDF files to this:
```
{'date': (2014, 5, 7), 'invoice_number': '30064443', 'amount': 34.73, 'desc': 'Invoice 30064443 from QualityHosting'}
{'date': (2014, 6, 4), 'invoice_number': 'EUVINS1-OF5-DE-120725895', 'amount': 35.24, 'desc': 'Invoice EUVINS1-OF5-DE-120725895 from Amazon EU'}
{'date': (2014, 8, 3), 'invoice_number': '42183017', 'amount': 4.11, 'desc': 'Invoice 42183017 from Amazon Web Services'}
{'date': (2015, 1, 28), 'invoice_number': '12429647', 'amount': 101.0, 'desc': 'Invoice 12429647 from Envato'}
```
## Installation
1. Install pdftotext
We need the latest verion of `xpdf` because support for table layouts was only recenlty added. You can download the binary files from www.foolabs.com/xpdf/download.html
2. Install `invoice2data` using pip
```
pip install invoice2data
```
Optionally this uses `pdfminer`, but `pdftotext` works better. You can choose which module to use. No special Python packages are necessary at the moment, except for `pdftotext`.
There is also `tesseract` integration as a fallback, if no text can be extracted. But it may be more reliable to use
## Usage
Basic usage. Process PDF files and write result to CSV.
- `invoice2data invoice.pdf`
- `invoice2data *.pdf`
Specify folder with yml templates. (e.g. your suppliers)
`invoice2data --template-folder ACME-templates invoice.pdf`
Processes a folder of invoices and copies renamed invoices to new folder.
`invoice2data --copy new_folder folder_with_invoices/*.pdf`
Processes a single file and dumps whole file for debugging (useful when adding new templates in templates.py)
`invoice2data --debug my_invoice.pdf`
Recognize test invoices:
`invoice2data invoice2data/test/pdfs/* --debug`
If you want to use it as a lib just do
```
from invoice2data import extract_data
result = extract_data('path/to/my/file.pdf')
```
## Template system
See `invoice2data/templates` for existing templates. Just extend the list to add your own. If deployed by a bigger organisation, there should be an interface to edit templates for new suppliers. 80-20 rule.
Templates are based on Yaml. They define one or more keywords to find the right template and regexp for fields to be extracted. They could also be a static value, like the full company name.
We may extend them to feature options to be used during invoice processing.
Example:
```
issuer: Amazon Web Services, Inc.
keywords:
- Amazon Web Services
fields:
amount: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+)
amount_untaxed: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+)
date: Invoice Date:\s+([a-zA-Z]+ \d+ , \d+)
invoice_number: Invoice Number:\s+(\d+)
partner_name: (Amazon Web Services, Inc\.)
options:
remove_whitespace: false
currency: HKD
date_formats:
- '%d/%m/%Y'
```
## Roadmap
Currently this is a proof of concept. If you scan your invoices, this could easily be connected to an OCR system. Biggest weakness is the need to manually enter new regexes. I don't see an easy way to make it "learn" new patterns.
Planned features:
- integrate with online OCR
- try to 'guess' parameters for new invoice formats
- can apply machine learning to guess new parameters?
## Contributors
- Alexis de Lattre: Add setup.py for Pypi, fix locale bug, add templates for new invoice types.
[![Circle CI](https://circleci.com/gh/m3nu/invoice2data.svg?style=svg)](https://circleci.com/gh/m3nu/invoice2data)
A Python library to support your accounting process.
- extracts text from PDF files
- searches for regex in the result
- saves results as CSV
- optionally renames PDF files to match the content
With the flexible template system you can:
- precisely match PDF files
- define static fields that are the same for every invoice
- have multiple regex per field (if layout or wording changes)
- define currency
Go from PDF files to this:
```
{'date': (2014, 5, 7), 'invoice_number': '30064443', 'amount': 34.73, 'desc': 'Invoice 30064443 from QualityHosting'}
{'date': (2014, 6, 4), 'invoice_number': 'EUVINS1-OF5-DE-120725895', 'amount': 35.24, 'desc': 'Invoice EUVINS1-OF5-DE-120725895 from Amazon EU'}
{'date': (2014, 8, 3), 'invoice_number': '42183017', 'amount': 4.11, 'desc': 'Invoice 42183017 from Amazon Web Services'}
{'date': (2015, 1, 28), 'invoice_number': '12429647', 'amount': 101.0, 'desc': 'Invoice 12429647 from Envato'}
```
## Installation
1. Install pdftotext
We need the latest verion of `xpdf` because support for table layouts was only recenlty added. You can download the binary files from www.foolabs.com/xpdf/download.html
2. Install `invoice2data` using pip
```
pip install invoice2data
```
Optionally this uses `pdfminer`, but `pdftotext` works better. You can choose which module to use. No special Python packages are necessary at the moment, except for `pdftotext`.
There is also `tesseract` integration as a fallback, if no text can be extracted. But it may be more reliable to use
## Usage
Basic usage. Process PDF files and write result to CSV.
- `invoice2data invoice.pdf`
- `invoice2data *.pdf`
Specify folder with yml templates. (e.g. your suppliers)
`invoice2data --template-folder ACME-templates invoice.pdf`
Processes a folder of invoices and copies renamed invoices to new folder.
`invoice2data --copy new_folder folder_with_invoices/*.pdf`
Processes a single file and dumps whole file for debugging (useful when adding new templates in templates.py)
`invoice2data --debug my_invoice.pdf`
Recognize test invoices:
`invoice2data invoice2data/test/pdfs/* --debug`
If you want to use it as a lib just do
```
from invoice2data import extract_data
result = extract_data('path/to/my/file.pdf')
```
## Template system
See `invoice2data/templates` for existing templates. Just extend the list to add your own. If deployed by a bigger organisation, there should be an interface to edit templates for new suppliers. 80-20 rule.
Templates are based on Yaml. They define one or more keywords to find the right template and regexp for fields to be extracted. They could also be a static value, like the full company name.
We may extend them to feature options to be used during invoice processing.
Example:
```
issuer: Amazon Web Services, Inc.
keywords:
- Amazon Web Services
fields:
amount: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+)
amount_untaxed: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+)
date: Invoice Date:\s+([a-zA-Z]+ \d+ , \d+)
invoice_number: Invoice Number:\s+(\d+)
partner_name: (Amazon Web Services, Inc\.)
options:
remove_whitespace: false
currency: HKD
date_formats:
- '%d/%m/%Y'
```
## Roadmap
Currently this is a proof of concept. If you scan your invoices, this could easily be connected to an OCR system. Biggest weakness is the need to manually enter new regexes. I don't see an easy way to make it "learn" new patterns.
Planned features:
- integrate with online OCR
- try to 'guess' parameters for new invoice formats
- can apply machine learning to guess new parameters?
## Contributors
- Alexis de Lattre: Add setup.py for Pypi, fix locale bug, add templates for new invoice types.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
invoice2data-0.2.8.tar.gz
(361.0 kB
view details)
Built Distribution
invoice2data-0.2.8-py2.7.egg
(385.5 kB
view details)
File details
Details for the file invoice2data-0.2.8.tar.gz
.
File metadata
- Download URL: invoice2data-0.2.8.tar.gz
- Upload date:
- Size: 361.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 05033e44c0ec6ea42c49dd43b92e168a2f7251836b75321b9abbf747cfa415cc |
|
MD5 | 29b3dd12dc2fd4dfe3e6f8f097fda2b0 |
|
BLAKE2b-256 | 536e85cb028901e959f73fe7b1557c244e3681e8aa5ce19a71a3c528bedb5964 |
File details
Details for the file invoice2data-0.2.8-py2.7.egg
.
File metadata
- Download URL: invoice2data-0.2.8-py2.7.egg
- Upload date:
- Size: 385.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4b39e3a2e8ed165caff62261fe160c7e8eea76adf9ce69ec10c7dbf8beba3702 |
|
MD5 | d6ac7d0f4fbb15f901ed7b902f8a865b |
|
BLAKE2b-256 | 42b63dc6de34165850da9c5198cd0c36d0108d01f6644052eda0fe68039621d7 |