Python parser to extract data from pdf invoice

These details have not been verified by PyPI

Project links

Homepage

Project description

Data extractor for PDF invoices - invoice2data

A command line tool and Python library to support your accounting process.

extracts text from PDF files using different techniques, like pdftotext, pdfminer or OCR -- tesseract, tesseract4 or gvision (Google Cloud Vision).
searches for regex in the result using a YAML-based template system
saves results as CSV, JSON or XML or renames PDF files to match the content.

With the flexible template system you can:

precisely match content PDF files
plugins available to match line items and tables
define static fields that are the same for every invoice
define custom fields needed in your organisation or process
have multiple regex per field (if layout or wording changes)
define currency
extract invoice-items using the lines-plugin developed by Holger Brunn

Go from PDF files to this:

{'date': (2014, 5, 7), 'invoice_number': '30064443', 'amount': 34.73, 'desc': 'Invoice 30064443 from QualityHosting', 'lines': [{'price': 42.0, 'desc': u'Small Business StandardExchange 2010\nGrundgeb\xfchr pro Einheit\nDienst: OUDJQ_office\n01.05.14-31.05.14\n', 'pos': u'7', 'qty': 1.0}]}
{'date': (2014, 6, 4), 'invoice_number': 'EUVINS1-OF5-DE-120725895', 'amount': 35.24, 'desc': 'Invoice EUVINS1-OF5-DE-120725895 from Amazon EU'}
{'date': (2014, 8, 3), 'invoice_number': '42183017', 'amount': 4.11, 'desc': 'Invoice 42183017 from Amazon Web Services'}
{'date': (2015, 1, 28), 'invoice_number': '12429647', 'amount': 101.0, 'desc': 'Invoice 12429647 from Envato'}

Installation

Install pdftotext

If possible get the latest xpdf/poppler-utils version. It's included with macOS Homebrew, Debian and Ubuntu. Without it, pdftotext won't parse tables in PDF correctly.

Install invoice2data using pip

pip install invoice2data

Usage

Basic usage. Process PDF files and write result to CSV.

invoice2data invoice.pdf
invoice2data *.pdf

Choose any of the following input readers:

pdftotext invoice2data --input-reader pdftotext invoice.pdf
tesseract invoice2data --input-reader tesseract invoice.pdf
pdf miner invoice2data --input-reader pdfminer invoice.pdf
tesseract4 invoice2data --input-reader tesseract4 invoice.pdf
gvision invoice2data --input-reader gvision invoice.pdf (needs GOOGLE_APPLICATION_CREDENTIALS env var)

Choose any of the following output formats:

csv invoice2data --output-format csv invoice.pdf
json invoice2data --output-format json invoice.pdf
xml invoice2data --output-format xml invoice.pdf

Save output file with custom name or a specific folder

invoice2data --output-format csv --output-name myinvoices/invoices.csv invoice.pdf

Note: You must specify the output-format in order to create output-name

Specify folder with yml templates. (e.g. your suppliers)

invoice2data --template-folder ACME-templates invoice.pdf

Only use your own templates and exclude built-ins

invoice2data --exclude-built-in-templates --template-folder ACME-templates invoice.pdf

Processes a folder of invoices and copies renamed invoices to new folder.

invoice2data --copy new_folder folder_with_invoices/*.pdf

Processes a single file and dumps whole file for debugging (useful when adding new templates in templates.py)

invoice2data --debug my_invoice.pdf

Recognize test invoices: invoice2data invoice2data/test/pdfs/* --debug

Use as Python Library

You can easily add invoice2data to your own Python scripts as library.

from invoice2data import extract_data
result = extract_data('path/to/my/file.pdf')

Using in-house templates

from invoice2data import extract_data
from invoice2data.extract.loader import read_templates

templates = read_templates('/path/to/your/templates/')
result = extract_data(filename, templates=templates)

Template system

See invoice2data/extract/templates for existing templates. Just extend the list to add your own. If deployed by a bigger organisation, there should be an interface to edit templates for new suppliers. 80-20 rule. For a short tutorial on how to add new templates, see TUTORIAL.md.

Templates are based on Yaml. They define one or more keywords to find the right template, one or more exclude_keywords to further narrow it down and regexp for fields to be extracted. They could also be a static value, like the full company name.

Template files are tried in alphabetical order.

We may extend them to feature options to be used during invoice processing.

Example:

issuer: Amazon Web Services, Inc.
keywords:
- Amazon Web Services
exclude_keywords:
- San Jose
fields:
  amount: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+)
  amount_untaxed: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+)
  date: Invoice Date:\s+([a-zA-Z]+ \d+ , \d+)
  invoice_number: Invoice Number:\s+(\d+)
  partner_name: (Amazon Web Services, Inc\.)
options:
  remove_whitespace: false
  currency: HKD
  date_formats:
    - '%d/%m/%Y'
lines:
    start: Detail
    end: \* May include estimated US sales tax
    first_line: ^    (?P<description>\w+.*)\$(?P<price_unit>\d+\.\d+)
    line: (.*)\$(\d+\.\d+)
    last_line: VAT \*\*

Development

If you are interested in improving this project, have a look at our developer guide to get you started quickly.

Roadmap and open tasks

integrate with online OCR?
try to 'guess' parameters for new invoice formats.
can apply machine learning to guess new parameters?

Maintainers

Contributors

Harshit Joshi: As Google Summer of Code student.
Holger Brunn: Add support for parsing invoice items.

Related Projects

OCR-Invoice (FOSS | C#)
DeepLogic AI (Commercial | SaaS)
Docparser (Commercial | Web Service)
A-PDF (Commercial)
PDFdeconstruct (Commercial)
CVision (Commercial)

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.4.5

Nov 26, 2023

0.4.4

Apr 8, 2023

0.4.3

Mar 31, 2023

0.4.2

Feb 11, 2023

0.4.1

Feb 6, 2023

0.4.0

Dec 12, 2022

This version

0.3.6

Jun 21, 2021

0.3.5

Aug 21, 2019

0.3.4

Jun 21, 2019

0.3.3

Jan 30, 2019

0.3.2

Nov 28, 2018

0.3.1

Nov 9, 2018

0.2.103

Nov 9, 2018

0.2.101

Sep 8, 2018

0.2.100

Aug 17, 2018

0.2.99

Aug 9, 2018

0.2.98

Jun 8, 2018

0.2.97

May 27, 2018

0.2.96

May 27, 2018

0.2.95

May 27, 2018

0.2.94

May 27, 2018

0.2.93

May 24, 2018

0.2.92

May 22, 2018

0.2.91

May 21, 2018

0.2.90

May 21, 2018

0.2.89

May 20, 2018

0.2.88

May 15, 2018

0.2.87

May 15, 2018

0.2.86

May 14, 2018

0.2.85

May 13, 2018

0.2.84

May 6, 2018

0.2.83

May 2, 2018

0.2.82

Apr 20, 2018

0.2.81

Mar 20, 2018

0.2.80

Mar 19, 2018

0.2.79

Mar 18, 2018

0.2.78

Mar 15, 2018

0.2.77

Mar 14, 2018

0.2.76

Feb 26, 2018

0.2.75

Feb 26, 2018

0.2.74

Feb 17, 2018

0.2.73

Feb 16, 2018

0.2.72

Feb 15, 2018

0.2.71

Feb 15, 2018

0.2.70

Jan 23, 2018

0.2.69

Jan 10, 2018

0.2.67

Dec 1, 2017

0.2.66

Nov 7, 2017

0.2.65

Oct 3, 2017

0.2.64

Sep 29, 2017

0.2.63

Sep 29, 2017

0.2.62

Sep 26, 2017

0.2.61

Aug 31, 2017

0.2.59

Jul 4, 2017

0.2.58

Jun 20, 2017

0.2.56

Jun 14, 2017

0.2.55

May 31, 2017

0.2.54

May 24, 2017

0.2.53

May 18, 2017

0.2.51

Mar 29, 2017

0.2.49

Mar 23, 2017

0.2.47

Mar 8, 2017

0.2.45

Mar 8, 2017

0.2.44

Mar 8, 2017

0.2.43

Feb 3, 2017

0.2.42

Jan 23, 2017

0.2.41

Jan 4, 2017

0.2.40

Dec 29, 2016

0.2.39

Dec 16, 2016

0.2.38

Nov 13, 2016

0.2.36

Oct 6, 2016

0.2.34

Oct 4, 2016

0.2.33

Sep 30, 2016

0.2.31

Sep 30, 2016

0.2.30

Sep 28, 2016

0.2.29

Jun 25, 2016

0.2.28

Jun 7, 2016

0.2.27

May 25, 2016

0.2.26

May 14, 2016

0.2.25

May 14, 2016

0.2.24

May 14, 2016

0.2.22

May 14, 2016

0.2.21

May 14, 2016

0.2.20

May 14, 2016

0.2.19

May 14, 2016

0.2.18

May 14, 2016

0.2.17

May 14, 2016

0.2.16

May 14, 2016

0.2.15

May 14, 2016

0.2.14

Apr 3, 2016

0.2.13

Apr 2, 2016

0.2.10

Apr 2, 2016

0.2.9

Apr 2, 2016

0.2.8

Apr 2, 2016

0.2.5

Mar 30, 2016

0.2.4

Mar 30, 2016

0.2.3

Mar 30, 2016

0.2.2

Mar 30, 2016

0.2.1

Mar 30, 2016

0.2.0

Jan 23, 2016

0.1.2

Jan 2, 2016

0.0.1

Dec 26, 2015

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

invoice2data-0.3.6.tar.gz (765.9 kB view details)

Uploaded Jun 21, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

invoice2data-0.3.6-py3.9.egg (118.7 kB view details)

Uploaded Jun 21, 2021 Egg

File details

Details for the file invoice2data-0.3.6.tar.gz.

File metadata

Download URL: invoice2data-0.3.6.tar.gz
Upload date: Jun 21, 2021
Size: 765.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.2

File hashes

Hashes for invoice2data-0.3.6.tar.gz
Algorithm	Hash digest
SHA256	`59d4c9e72cfb1577a74f25dd004c8d5d222bd3f82457489293c093bbec099761`
MD5	`855710f3bcaf02832e4d80991abc1358`
BLAKE2b-256	`cc12dd218a0d76b8e35cb063dac930d37a382e3e3f696c56794810dc204d75ee`

See more details on using hashes here.

File details

Details for the file invoice2data-0.3.6-py3.9.egg.

File metadata

Download URL: invoice2data-0.3.6-py3.9.egg
Upload date: Jun 21, 2021
Size: 118.7 kB
Tags: Egg
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.2

File hashes

Hashes for invoice2data-0.3.6-py3.9.egg
Algorithm	Hash digest
SHA256	`ff3b767056dfac473037d6caa1165e6a2b1c2683913d6df0a83a381f81ca48fa`
MD5	`958c577219f5ef1c9a169f9d6c366061`
BLAKE2b-256	`9ec7f0e2a3906029ee3eec25fb9721d77bd4f58934e153946c685ac775c3e11d`

See more details on using hashes here.

invoice2data 0.3.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Data extractor for PDF invoices - invoice2data

Installation

Usage

Use as Python Library

Template system

Development

Roadmap and open tasks

Maintainers

Contributors

Related Projects

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes