Skip to main content

PDF Statement Reader

Project description

PDF Statement Reader

Build Status PyPI version Coverage Status

Python library and command line tool for parsing pdf bank statements

Inspired by https://github.com/antonburger/pdf2csv

Objectives

Banks generally send account statements in pdf format. These pdfs are often encrypted, the pdf format is difficult to extract tables from and when you finally get the table out it's in a non tidy format. This package aims to help by providing a library of functions and a set of command line tools for converting these statements into more useful formats such as csv files and pandas dataframes.

Installation

pip install pdf-statement-reader

Troubleshooting

This package uses tabula-py under the hood, which itself is a wrapper for tabula-java. You thus need to have java installed for it to work. If you have any errors complaining about java, checkout out the tabula-py page for troubleshooting advice.

In the future, we hope to move to a pure python implementation.

Usage

The package provides a command line application psr

Usage: psr [OPTIONS] COMMAND [ARGS]...

  Utility for reading bank and other statements in pdf form

Options:
  --help  Show this message and exit.

Commands:
  bulk      Bulk converts all files in a folder
  decrypt   Decrypts a pdf file Uses pikepdf to open an encrypted pdf file...
  pdf2csv   Converts a pdf statement to a csv file using a given format
  validate  Validates the csv statement rolling balance

Configuration

PDF files are notoriously difficult to extract data from. (Here's a nice blog post on why). For a really good semi-manual GUI solution, check out tabula. In fact this package uses tabula's pdf parsing library under the hood.

Since bank statements are generally of the same (if inconvenient) format, we can set up a configuration to tell the tool how to grab the data.

For each type of bank statement, the exact format will be different. A config file holds the instructions for how to process the raw pdf. For now the only config supported is for Cheque account statements from Absa bank in South Africa.

To set up a different statement, you can simply add a new config file and then tell the psr tool to use it. These config files are stored in a folder structure as follows:

config > [country code] > [bank] > [statement type].json

So for example the default config is stored in

config > za > absa > cheque.json

The config spec is a code of the form

[country code].[bank].[statement type]

Once again for the default this will be

za.absa.cheque

The configuration file itself is in JSON format. Here's the Absa cheque account one with some commentary to explain what each field does.

{
    // Describes the page layout that should be scanned
    "layout": { 
        // Default layout for all pages not otherwise defined
        "default": {
            // The page coordinates in containing the table in pts 
            // [top, left, bottom, right]
            "area": [280, 27, 763, 576],
            // The right x coordinate of each column in the table
            "columns": [83, 264, 344, 425, 485, 570]
        },
        // Layout for the first page
        "first": {
            "area": [480, 27, 763, 576],
            "columns": [83, 264, 344, 425, 485, 570]
        }
    },

    // The columns names to be used as they exactly appear
    // in the statement
    "columns": {
        "trans_date": "Date",
        "trans_type": "Transaction Description",
        "trans_detail": "Transaction Detail",
        "debit": "Debit Amount",
        "credit": "Credit Amount",
        "balance": "Balance"
    },

    // The order of the columns to be output in the csv
    "order": [
        "trans_date",
        "trans_type",
        "trans_detail",
        "debit",
        "credit",
        "balance"
    ],

    // Specifies any cleaning operations required
    "cleaning": {
        // Convert these columns to numeric
        "numeric": ["debit", "credit", "balance"],
        // Convert these columns to date
        "date": ["trans_date"],
        // Use this date format to parse any date columns
        "date_format": "%d/%m/%Y",
        // For cases where the transaction detail is stored
        // in the next line below the transaction type
        "trans_detail": "below",
        // Only keep the rows where these columns are populated
        "dropna": ["balance"]
    }
}

These were the configuration options that were required for the default format. It is envisaged that as more formats are added, the list of options will grow.

CLI API

decrypt

Usage: psr decrypt [OPTIONS] INPUT_FILENAME [OUTPUT_FILENAME]

  Decrypts a pdf file

  Uses pikepdf to open an encrypted pdf file and then save the unencrypted
  version. If no output_filename is specified then overwrites the original
  file.

Options:
  -p, --password TEXT  The pdf encryption password. If not supplied, it will
                       be requested at the prompt
  --help               Show this message and exit.

pdf2csv

Usage: psr pdf2csv [OPTIONS] INPUT_FILENAME [OUTPUT_FILENAME]

  Converts a pdf statement to a csv file using a given format

Options:
  -c, --config TEXT  The configuration code defining how the file should be
                     parsed  [default: za.absa.cheque]
  --help             Show this message and exit.

validate

Usage: psr validate [OPTIONS] INPUT_FILENAME

  Validates the csv statement rolling balance

Options:
  -c, --config TEXT  The configuration code defining how the file should be
                     parsed  [default: za.absa.cheque]
  --help             Show this message and exit.

bulk

Usage: psr bulk [OPTIONS] FOLDER

  Bulk converts all files in a folder

Options:
  -c, --config TEXT          The configuration code defining how the file
                             should be parsed  [default: za.absa.cheque]
  -p, --password TEXT        The pdf encryption password. If not supplied, it
                             will be requested at the prompt
  -d, --decrypt-suffix TEXT  The suffix to append to the decrypted pdf file
                             when created  [default: _decrypted]
  -k, --keep-decrypted       Keep the a copy of the decrypted file. It is
                             removed by default
  -v, --verbose              Print verbose output while running
  --help                     Show this message and exit.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_statement_reader-0.1.2.tar.gz (8.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_statement_reader-0.1.2-py3-none-any.whl (9.7 kB view details)

Uploaded Python 3

File details

Details for the file pdf_statement_reader-0.1.2.tar.gz.

File metadata

  • Download URL: pdf_statement_reader-0.1.2.tar.gz
  • Upload date:
  • Size: 8.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for pdf_statement_reader-0.1.2.tar.gz
Algorithm Hash digest
SHA256 8895ec0e5d654bb2025b3311329efe7ece8a35b318455507c86084706dd9d131
MD5 c139928ae5331100acd39a0af096eacb
BLAKE2b-256 604ec3420fa5f91b40abe43d2d5cf7d0c00ba70b58d735e9301747f2fb06d7ca

See more details on using hashes here.

File details

Details for the file pdf_statement_reader-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: pdf_statement_reader-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 9.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for pdf_statement_reader-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c2dff97d1c411c15c7340660de07deb944fe61fddf16c0500ca5ac91a26cc288
MD5 aa6cd82d36f5fbd19229d8f0441fc729
BLAKE2b-256 8e6cae1bde9bd1068dead8981417754256f6498f97ce3d25a1505e9681cdce93

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page