Python parser to extract data from pdf invoice
Project description
Data extractor for PDF invoices - ninvoice2data
==============================================
|Circle CI|
This project has been selected for `GSoC
2018 <https://developers.google.com/open-source/gsoc/>`__. Read more
`here <https://wiki.debian.org/SummerOfCode2018/Projects/ExtractingDataFromPDFInvoicesAndBillsDetails>`__.
A modular Python library to support your accounting process. Tested on
Python 2.7 and 3.4+. Main steps:
1. extracts text from PDF files using different techniques, like
``pdftotext``, ``pdfminer`` or OCR – ``tesseract``, ``tesseract4`` or
``gvision`` (Google Cloud Vision).
2. searches for regex in the result using a YAML-based template system
3. saves results as CSV, JSON or XML or renames PDF files to match the
content.
With the flexible template system you can:
- precisely match content PDF files
- plugins available to match line items and tables
- define static fields that are the same for every invoice
- define custom fields needed in your organisation or process
- have multiple regex per field (if layout or wording changes)
- define currency
- extract invoice-items using the ``lines``-plugin developed by `Holger
Brunn <https://github.com/hbrunn>`__
Go from PDF files to this:
::
{'date': (2014, 5, 7), 'invoice_number': '30064443', 'amount': 34.73, 'desc': 'Invoice 30064443 from QualityHosting', 'lines': [{'price': 42.0, 'desc': u'Small Business StandardExchange 2010\nGrundgeb\xfchr pro Einheit\nDienst: OUDJQ_office\n01.05.14-31.05.14\n', 'pos': u'7', 'qty': 1.0}]}
{'date': (2014, 6, 4), 'invoice_number': 'EUVINS1-OF5-DE-120725895', 'amount': 35.24, 'desc': 'Invoice EUVINS1-OF5-DE-120725895 from Amazon EU'}
{'date': (2014, 8, 3), 'invoice_number': '42183017', 'amount': 4.11, 'desc': 'Invoice 42183017 from Amazon Web Services'}
{'date': (2015, 1, 28), 'invoice_number': '12429647', 'amount': 101.0, 'desc': 'Invoice 12429647 from Envato'}
Installation
------------
1. Install pdftotext
If possible get the latest
`xpdf/poppler-utils <https://poppler.freedesktop.org/>`__ version. It’s
included with macOS Homebrew, Debian and Ubuntu. Without it,
``pdftotext`` won’t parse tables in PDF correctly.
2. Install ``ninvoice2data`` using pip
::
pip install ninvoice2data
Usage
-----
Basic usage. Process PDF files and write result to CSV.
- ``ninvoice2data invoice.pdf``
- ``ninvoice2data *.pdf``
Choose any of the following input readers:
- pdftotext ``ninvoice2data --input-reader pdftotext invoice.pdf``
- tesseract ``ninvoice2data --input-reader tesseract invoice.pdf``
- pdf miner ``ninvoice2data --input-reader pdfminer invoice.pdf``
- tesseract4 ``ninvoice2data --input-reader tesseract4 invoice.pdf``
- gvision ``ninvoice2data --input-reader gvision invoice.pdf`` (needs ``GOOGLE_APPLICATION_CREDENTIALS`` env var)
Choose any of the following output formats:
- csv ``ninvoice2data --output-format csv invoice.pdf``
- json ``ninvoice2data --output-format json invoice.pdf``
- xml ``ninvoice2data --output-format xml invoice.pdf``
Save output file with custom name or a specific folder
``ninvoice2data --output-format csv --output-name myinvoices/invoices.csv invoice.pdf``
**Note:** You must specify the ``output-format`` in order to create
``output-name``
Specify folder with yml templates. (e.g. your suppliers)
``ninvoice2data --template-folder ACME-templates invoice.pdf``
Only use your own templates and exclude built-ins
``ninvoice2data --exclude-built-in-templates --template-folder ACME-templates invoice.pdf``
Processes a folder of invoices and copies renamed invoices to new
folder. ``ninvoice2data --copy new_folder folder_with_invoices/*.pdf``
Processes a single file and dumps whole file for debugging (useful when
adding new templates in templates.py)
``ninvoice2data --debug my_invoice.pdf``
Recognize test invoices:
``ninvoice2data ninvoice2data/test/pdfs/* --debug``
If you want to use it as a lib just do
::
from ninvoice2data import extract_data
result = extract_data('path/to/my/file.pdf')
Template system
---------------
See ``ninvoice2data/extract/templates`` for existing templates. Just extend the
list to add your own. If deployed by a bigger organisation, there should
be an interface to edit templates for new suppliers. 80-20 rule. For a
short tutorial on how to add new templates, see
`TUTORIAL.rst <TUTORIAL.rst>`__.
Templates are based on Yaml. They define one or more keywords to find
the right template and regexp for fields to be extracted. They could
also be a static value, like the full company name.
Template files are tried in alphabetical order.
We may extend them to feature options to be used during invoice
processing.
Example:
::
issuer: Amazon Web Services, Inc.
keywords:
- Amazon Web Services
fields:
amount: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+)
amount_untaxed: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+)
date: Invoice Date:\s+([a-zA-Z]+ \d+ , \d+)
invoice_number: Invoice Number:\s+(\d+)
partner_name: (Amazon Web Services, Inc\.)
options:
remove_whitespace: false
currency: HKD
date_formats:
- '%d/%m/%Y'
lines:
start: Detail
end: \* May include estimated US sales tax
first_line: ^ (?P<description>\w+.*)\$(?P<price_unit>\d+\.\d+)
line: (.*)\$(\d+\.\d+)
last_line: VAT \*\*
Development
-----------
If you are interested in improving this project, have a look at our
`developer guide <DEVELOP.rst>`__ to get you started quickly.
Roadmap and open tasks
----------------------
- integrate with online OCR?
- try to ‘guess’ parameters for new invoice formats.
- can apply machine learning to guess new parameters?
Maintainers
-----------
- `Manuel Riel <https://github.com/m3nu>`__
- `Alexis de Lattre <https://github.com/alexis-via>`__
Contributors
------------
- `Harshit Joshi <https://github.com/duskybomb>`__: As Google Summer of
Code student.
- `Holger Brunn <https://github.com/hbrunn>`__: Add support for parsing
invoice items.
Related Projects
----------------
- `OCR-Invoice <https://github.com/robela/OCR-Invoice>`__ (FOSS \| C#)
- `Docparser <https://docparser.com/>`__ (Commercial \| Web Service)
- `A-PDF <http://www.a-pdf.com/data-extractor/index.htm>`__
(Commercial)
- `PDFdeconstruct <http://www.glyphandcog.com/PDFdeconstruct.html?g6>`__
(Commercial)
- `CVision <http://www.cvisiontech.com/library/document-automation/forms-processing/extract-data-from-invoice.html>`__
(Commercial)
.. |Circle CI| image:: https://circleci.com/gh/invoice-x/ninvoice2data.svg?style=svg
:target: https://circleci.com/gh/invoice-x/ninvoice2data
==============================================
|Circle CI|
This project has been selected for `GSoC
2018 <https://developers.google.com/open-source/gsoc/>`__. Read more
`here <https://wiki.debian.org/SummerOfCode2018/Projects/ExtractingDataFromPDFInvoicesAndBillsDetails>`__.
A modular Python library to support your accounting process. Tested on
Python 2.7 and 3.4+. Main steps:
1. extracts text from PDF files using different techniques, like
``pdftotext``, ``pdfminer`` or OCR – ``tesseract``, ``tesseract4`` or
``gvision`` (Google Cloud Vision).
2. searches for regex in the result using a YAML-based template system
3. saves results as CSV, JSON or XML or renames PDF files to match the
content.
With the flexible template system you can:
- precisely match content PDF files
- plugins available to match line items and tables
- define static fields that are the same for every invoice
- define custom fields needed in your organisation or process
- have multiple regex per field (if layout or wording changes)
- define currency
- extract invoice-items using the ``lines``-plugin developed by `Holger
Brunn <https://github.com/hbrunn>`__
Go from PDF files to this:
::
{'date': (2014, 5, 7), 'invoice_number': '30064443', 'amount': 34.73, 'desc': 'Invoice 30064443 from QualityHosting', 'lines': [{'price': 42.0, 'desc': u'Small Business StandardExchange 2010\nGrundgeb\xfchr pro Einheit\nDienst: OUDJQ_office\n01.05.14-31.05.14\n', 'pos': u'7', 'qty': 1.0}]}
{'date': (2014, 6, 4), 'invoice_number': 'EUVINS1-OF5-DE-120725895', 'amount': 35.24, 'desc': 'Invoice EUVINS1-OF5-DE-120725895 from Amazon EU'}
{'date': (2014, 8, 3), 'invoice_number': '42183017', 'amount': 4.11, 'desc': 'Invoice 42183017 from Amazon Web Services'}
{'date': (2015, 1, 28), 'invoice_number': '12429647', 'amount': 101.0, 'desc': 'Invoice 12429647 from Envato'}
Installation
------------
1. Install pdftotext
If possible get the latest
`xpdf/poppler-utils <https://poppler.freedesktop.org/>`__ version. It’s
included with macOS Homebrew, Debian and Ubuntu. Without it,
``pdftotext`` won’t parse tables in PDF correctly.
2. Install ``ninvoice2data`` using pip
::
pip install ninvoice2data
Usage
-----
Basic usage. Process PDF files and write result to CSV.
- ``ninvoice2data invoice.pdf``
- ``ninvoice2data *.pdf``
Choose any of the following input readers:
- pdftotext ``ninvoice2data --input-reader pdftotext invoice.pdf``
- tesseract ``ninvoice2data --input-reader tesseract invoice.pdf``
- pdf miner ``ninvoice2data --input-reader pdfminer invoice.pdf``
- tesseract4 ``ninvoice2data --input-reader tesseract4 invoice.pdf``
- gvision ``ninvoice2data --input-reader gvision invoice.pdf`` (needs ``GOOGLE_APPLICATION_CREDENTIALS`` env var)
Choose any of the following output formats:
- csv ``ninvoice2data --output-format csv invoice.pdf``
- json ``ninvoice2data --output-format json invoice.pdf``
- xml ``ninvoice2data --output-format xml invoice.pdf``
Save output file with custom name or a specific folder
``ninvoice2data --output-format csv --output-name myinvoices/invoices.csv invoice.pdf``
**Note:** You must specify the ``output-format`` in order to create
``output-name``
Specify folder with yml templates. (e.g. your suppliers)
``ninvoice2data --template-folder ACME-templates invoice.pdf``
Only use your own templates and exclude built-ins
``ninvoice2data --exclude-built-in-templates --template-folder ACME-templates invoice.pdf``
Processes a folder of invoices and copies renamed invoices to new
folder. ``ninvoice2data --copy new_folder folder_with_invoices/*.pdf``
Processes a single file and dumps whole file for debugging (useful when
adding new templates in templates.py)
``ninvoice2data --debug my_invoice.pdf``
Recognize test invoices:
``ninvoice2data ninvoice2data/test/pdfs/* --debug``
If you want to use it as a lib just do
::
from ninvoice2data import extract_data
result = extract_data('path/to/my/file.pdf')
Template system
---------------
See ``ninvoice2data/extract/templates`` for existing templates. Just extend the
list to add your own. If deployed by a bigger organisation, there should
be an interface to edit templates for new suppliers. 80-20 rule. For a
short tutorial on how to add new templates, see
`TUTORIAL.rst <TUTORIAL.rst>`__.
Templates are based on Yaml. They define one or more keywords to find
the right template and regexp for fields to be extracted. They could
also be a static value, like the full company name.
Template files are tried in alphabetical order.
We may extend them to feature options to be used during invoice
processing.
Example:
::
issuer: Amazon Web Services, Inc.
keywords:
- Amazon Web Services
fields:
amount: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+)
amount_untaxed: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+)
date: Invoice Date:\s+([a-zA-Z]+ \d+ , \d+)
invoice_number: Invoice Number:\s+(\d+)
partner_name: (Amazon Web Services, Inc\.)
options:
remove_whitespace: false
currency: HKD
date_formats:
- '%d/%m/%Y'
lines:
start: Detail
end: \* May include estimated US sales tax
first_line: ^ (?P<description>\w+.*)\$(?P<price_unit>\d+\.\d+)
line: (.*)\$(\d+\.\d+)
last_line: VAT \*\*
Development
-----------
If you are interested in improving this project, have a look at our
`developer guide <DEVELOP.rst>`__ to get you started quickly.
Roadmap and open tasks
----------------------
- integrate with online OCR?
- try to ‘guess’ parameters for new invoice formats.
- can apply machine learning to guess new parameters?
Maintainers
-----------
- `Manuel Riel <https://github.com/m3nu>`__
- `Alexis de Lattre <https://github.com/alexis-via>`__
Contributors
------------
- `Harshit Joshi <https://github.com/duskybomb>`__: As Google Summer of
Code student.
- `Holger Brunn <https://github.com/hbrunn>`__: Add support for parsing
invoice items.
Related Projects
----------------
- `OCR-Invoice <https://github.com/robela/OCR-Invoice>`__ (FOSS \| C#)
- `Docparser <https://docparser.com/>`__ (Commercial \| Web Service)
- `A-PDF <http://www.a-pdf.com/data-extractor/index.htm>`__
(Commercial)
- `PDFdeconstruct <http://www.glyphandcog.com/PDFdeconstruct.html?g6>`__
(Commercial)
- `CVision <http://www.cvisiontech.com/library/document-automation/forms-processing/extract-data-from-invoice.html>`__
(Commercial)
.. |Circle CI| image:: https://circleci.com/gh/invoice-x/ninvoice2data.svg?style=svg
:target: https://circleci.com/gh/invoice-x/ninvoice2data
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
ninvoice2data-0.4.16.tar.gz
(757.5 kB
view hashes)
Built Distribution
ninvoice2data-0.4.16-py2.7.egg
(93.8 kB
view hashes)