ninvoice2data

Python parser to extract data from pdf invoice

These details have not been verified by PyPI

Project links

Homepage

Project description

Data extractor for PDF invoices - ninvoice2data
==============================================

|Circle CI|

This project has been selected for `GSoC
2018 <https://developers.google.com/open-source/gsoc/>`__. Read more
`here <https://wiki.debian.org/SummerOfCode2018/Projects/ExtractingDataFromPDFInvoicesAndBillsDetails>`__.

A modular Python library to support your accounting process. Tested on
Python 2.7 and 3.4+. Main steps:

1. extracts text from PDF files using different techniques, like
``pdftotext``, ``pdfminer`` or OCR – ``tesseract``, ``tesseract4`` or
``gvision`` (Google Cloud Vision).
2. searches for regex in the result using a YAML-based template system
3. saves results as CSV, JSON or XML or renames PDF files to match the
content.

With the flexible template system you can:

- precisely match content PDF files
- plugins available to match line items and tables
- define static fields that are the same for every invoice
- define custom fields needed in your organisation or process
- have multiple regex per field (if layout or wording changes)
- define currency
- extract invoice-items using the ``lines``-plugin developed by `Holger
Brunn <https://github.com/hbrunn>`__

Go from PDF files to this:

::

{'date': (2014, 5, 7), 'invoice_number': '30064443', 'amount': 34.73, 'desc': 'Invoice 30064443 from QualityHosting', 'lines': [{'price': 42.0, 'desc': u'Small Business StandardExchange 2010\nGrundgeb\xfchr pro Einheit\nDienst: OUDJQ_office\n01.05.14-31.05.14\n', 'pos': u'7', 'qty': 1.0}]}
{'date': (2014, 6, 4), 'invoice_number': 'EUVINS1-OF5-DE-120725895', 'amount': 35.24, 'desc': 'Invoice EUVINS1-OF5-DE-120725895 from Amazon EU'}
{'date': (2014, 8, 3), 'invoice_number': '42183017', 'amount': 4.11, 'desc': 'Invoice 42183017 from Amazon Web Services'}
{'date': (2015, 1, 28), 'invoice_number': '12429647', 'amount': 101.0, 'desc': 'Invoice 12429647 from Envato'}

Installation
------------

1. Install pdftotext

If possible get the latest
`xpdf/poppler-utils <https://poppler.freedesktop.org/>`__ version. It’s
included with macOS Homebrew, Debian and Ubuntu. Without it,
``pdftotext`` won’t parse tables in PDF correctly.

2. Install ``ninvoice2data`` using pip

::

pip install ninvoice2data

Usage
-----

Basic usage. Process PDF files and write result to CSV.

- ``ninvoice2data invoice.pdf``
- ``ninvoice2data *.pdf``

Choose any of the following input readers:

- pdftotext ``ninvoice2data --input-reader pdftotext invoice.pdf``
- tesseract ``ninvoice2data --input-reader tesseract invoice.pdf``
- pdf miner ``ninvoice2data --input-reader pdfminer invoice.pdf``
- tesseract4 ``ninvoice2data --input-reader tesseract4 invoice.pdf``
- gvision ``ninvoice2data --input-reader gvision invoice.pdf`` (needs ``GOOGLE_APPLICATION_CREDENTIALS`` env var)

Choose any of the following output formats:

- csv ``ninvoice2data --output-format csv invoice.pdf``
- json ``ninvoice2data --output-format json invoice.pdf``
- xml ``ninvoice2data --output-format xml invoice.pdf``

Save output file with custom name or a specific folder
``ninvoice2data --output-format csv --output-name myinvoices/invoices.csv invoice.pdf``

**Note:** You must specify the ``output-format`` in order to create
``output-name``

Specify folder with yml templates. (e.g. your suppliers)
``ninvoice2data --template-folder ACME-templates invoice.pdf``

Only use your own templates and exclude built-ins
``ninvoice2data --exclude-built-in-templates --template-folder ACME-templates invoice.pdf``

Processes a folder of invoices and copies renamed invoices to new
folder. ``ninvoice2data --copy new_folder folder_with_invoices/*.pdf``

Processes a single file and dumps whole file for debugging (useful when
adding new templates in templates.py)
``ninvoice2data --debug my_invoice.pdf``

Recognize test invoices:
``ninvoice2data ninvoice2data/test/pdfs/* --debug``

If you want to use it as a lib just do

::

from ninvoice2data import extract_data

result = extract_data('path/to/my/file.pdf')

Template system
---------------

See ``ninvoice2data/extract/templates`` for existing templates. Just extend the
list to add your own. If deployed by a bigger organisation, there should
be an interface to edit templates for new suppliers. 80-20 rule. For a
short tutorial on how to add new templates, see
`TUTORIAL.rst <TUTORIAL.rst>`__.

Templates are based on Yaml. They define one or more keywords to find
the right template and regexp for fields to be extracted. They could
also be a static value, like the full company name.

Template files are tried in alphabetical order.

We may extend them to feature options to be used during invoice
processing.

Example:

::

issuer: Amazon Web Services, Inc.
keywords:
- Amazon Web Services
fields:
amount: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+)
amount_untaxed: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+)
date: Invoice Date:\s+([a-zA-Z]+ \d+ , \d+)
invoice_number: Invoice Number:\s+(\d+)
partner_name: (Amazon Web Services, Inc\.)
options:
remove_whitespace: false
currency: HKD
date_formats:
- '%d/%m/%Y'
lines:
start: Detail
end: \* May include estimated US sales tax
first_line: ^ (?P<description>\w+.*)\$(?P<price_unit>\d+\.\d+)
line: (.*)\$(\d+\.\d+)
last_line: VAT \*\*

Development
-----------

If you are interested in improving this project, have a look at our
`developer guide <DEVELOP.rst>`__ to get you started quickly.

Roadmap and open tasks
----------------------

- integrate with online OCR?
- try to ‘guess’ parameters for new invoice formats.
- can apply machine learning to guess new parameters?

Maintainers
-----------

- `Manuel Riel <https://github.com/m3nu>`__
- `Alexis de Lattre <https://github.com/alexis-via>`__

Contributors
------------

- `Harshit Joshi <https://github.com/duskybomb>`__: As Google Summer of
Code student.
- `Holger Brunn <https://github.com/hbrunn>`__: Add support for parsing
invoice items.

Related Projects
----------------

- `OCR-Invoice <https://github.com/robela/OCR-Invoice>`__ (FOSS \| C#)
- `Docparser <https://docparser.com/>`__ (Commercial \| Web Service)
- `A-PDF <http://www.a-pdf.com/data-extractor/index.htm>`__
(Commercial)
- `PDFdeconstruct <http://www.glyphandcog.com/PDFdeconstruct.html?g6>`__
(Commercial)
- `CVision <http://www.cvisiontech.com/library/document-automation/forms-processing/extract-data-from-invoice.html>`__
(Commercial)

.. |Circle CI| image:: https://circleci.com/gh/invoice-x/ninvoice2data.svg?style=svg
:target: https://circleci.com/gh/invoice-x/ninvoice2data

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.4.16

Feb 5, 2019

0.4.15

Feb 5, 2019

0.4.14

Feb 1, 2019

0.4.13

Feb 1, 2019

0.3.13

Feb 1, 2019

0.3.12

Feb 1, 2019

0.3.11

Feb 1, 2019

0.3.10

Feb 1, 2019

0.3.9

Feb 1, 2019

0.3.8

Feb 1, 2019

0.3.7

Feb 1, 2019

0.3.6

Feb 1, 2019

0.3.3

Feb 1, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ninvoice2data-0.4.16.tar.gz (757.5 kB view details)

Uploaded Feb 5, 2019 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ninvoice2data-0.4.16-py2.7.egg (93.8 kB view details)

Uploaded Feb 5, 2019 Egg

File details

Details for the file ninvoice2data-0.4.16.tar.gz.

File metadata

Download URL: ninvoice2data-0.4.16.tar.gz
Upload date: Feb 5, 2019
Size: 757.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.4.3 requests-toolbelt/0.9.1 tqdm/4.30.0 CPython/2.7.15

File hashes

Hashes for ninvoice2data-0.4.16.tar.gz
Algorithm	Hash digest
SHA256	`83be8084e8515a90468c560d83d8c4cde24b43b8362bb70a806e3481d12b4788`
MD5	`4d1bce40c7cec0150ace7d39077506d6`
BLAKE2b-256	`130ff8e60b48e431be0b39d541a41c2c174c113c03ab6aafb5098e649a1008ad`

See more details on using hashes here.

File details

Details for the file ninvoice2data-0.4.16-py2.7.egg.

File metadata

Download URL: ninvoice2data-0.4.16-py2.7.egg
Upload date: Feb 5, 2019
Size: 93.8 kB
Tags: Egg
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.4.3 requests-toolbelt/0.9.1 tqdm/4.30.0 CPython/2.7.15

File hashes

Hashes for ninvoice2data-0.4.16-py2.7.egg
Algorithm	Hash digest
SHA256	`d14fe1c8b6ab23ab0668d91753571c8d82171bd59bf3f19d1966e0551eac75e7`
MD5	`6014856eca592dd9ec18e0eb0d70b60d`
BLAKE2b-256	`de9e2fc5646fd016a6bfd87897eedcd08ea1950002877f964723d22ba25dacfd`

See more details on using hashes here.

ninvoice2data 0.4.16

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes