Skip to main content

Splits specific PDF files and stores the result in a custom directory layout

Project description

Split a large pdf file into several pieces.

Licence

This is a free software (GPLv3), see LICENCE.txt for licencing info.

What does it do

Provide a PDF file with all the salary sheets (or other type). In each page, you provide an ancode and a name (“name” of the employee and a specific key called “ancode”). This program splits the pdf into separated salary sheets.

It allows the people in charge of the salary sheet distribution to generate a file from their accounting software and to get an output with a structured directory structure.

enDI matches that directory structure to distribute the salary sheets through its interface.

By extension, it’s possible to configure splitting for several type of data.

How

The original file is named <filetype>_<year>_<month>.pdf (e.g : salaire_2020_11.pdf).

  • The filetype is used to fetch the appropriate parsing configuration.

  • The generated pdfs are placed under a directory named <filetype>/<year>/<month>/.

Using the configuration the tool searches for an “ancode” and a “name” into the pdf pages. It generates files in the form <ancode>_<name>.pdf containing the pages with the associated keywords, into the appropriate directory.

Installation

Distro dependencies

Install the pdftotext command line tool.

Debian based distros

apt-get install poppler-utils virtualenvwrapper

Install from pypi

mkvirtualenv splitpdf
pip install endi-split-pdf

Install from source

Setup a python virtual environment

mkvirtualenv splitpdf
pip install -r requirements.txt
pip install -e

Development mode

pip install -e .[dev]

Configuration

Main config file

Defaults to ~/.endi_split_pdf_config.yaml Specifiable by use -c <configfile>

Format is yaml.

Configure :

  • Filetypes

  • For each filetype, a parsing strategy describing the datatype to extract, the way they are extracted and the name of the file to generate. (Several parser can be provided, here we have three parsers for ancode and two for name)

Example

salaire:
    filename_template: {data1}_{data2}_{data3}.pdf
    datatypes:
       data1:
          - type: regex
            regex: '[\s]{30,80}(?P<ancode>[A-Z]{4,12})\s+'
            from_line: 11
            to_line: 14
            strict: True
            regex_group_name: ancode

          - type: coordinates
            line: 12
            column: 40
            prefix: "     "
          - type: coordinates
            line: 13
            column: 40
            prefix: "     "

        data2:
          - type: regex
            regex: '\s+(M|Mme|Mlle)\s+(?P<name>[\w\s]+)(\n|$)'
            from_line: 6
            to_line: 9
            regex_group_name: name
            strict: True

          - type: coordinates
            line: 8
            column: 50

        data3:
          - type: coordinates
            line: 2
            column: 60
            prefix: "Section :"

Here we configured the parsing of files named salaire_<year>_<month>.pdf

For each page, we’ll extract three data and group all pages by matching data1/data2/data3 3-uples. The pages will be grouped in a file named with the filename_template.

Finders

For each data you can cumulate several “finders” to achieve data retrieval. Finders are used in the specified order.

There are two types of finders.

RegexDataFinder

Use a regex to find the data in the page’s string representation (pdftotext output)

regex

The regex to use for matching the data (ancode or name) we’re looking for.

strict (default to False)

If True a result is returned if only one item matches the regexp

from_line / to_line (optionnal)

Specify a range of lines to restrict the amount of data we look at The Finder will only look between from_line and to_line line numbers (starts with 1).

regex_group_name (optionnal)

If the regexp uses groups, specify the group matching the data

CoordinateDataFinder

Use coordinates to find the data in the page’s string representation (pdftotext output)

line

The line where to look for the data

column

Where to start in the line

prefix (optionnal)

A prefix after which we should find the data

Logging

This program supports advanced logging configuration with the following options.

verbosity: DEBUG

Available options are DEBUG/INFO/WARNING/ERROR/CRITICAL

Syslog logging

use_syslog: true

Mail logging

log_to_mail: true
mail:
    host: localhost
    from: admin@host.fr
    to: contact@example.net
    subject: '[%(hostname)s] Log of the pdf splitter'

That program is smart enough to use syslog if the config specifies it.

It logs to mail if the config contains log_to_mail: True

Examples

A full run:

endi-split-pdf-run -c myconfig.yaml playground/salaires_2013_07.pdf

Test that the file is parseable on the 5 first pages:

endi-split-pdf-run  -c myconfig.yaml playground/salaires_2013_07.pdf -r 5

Use -v for debug messages.

Use -h to get a complete overview of options.

Known problems

  • for payrolls

    cannot handle some PDF files, especially if there is no outline and the charset is ‘binary’. Check this with:

    file -i $filename.pdf

Known problems

When the logs returns messages like :

CRITICAL - No page of output!

If the end user generates his files with Sage “Édition pilotée”, the problem may be that the end user used “print to PDF” export instead of “Save to PDF”.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

endi_split_pdf-1.0.10.tar.gz (29.2 kB view details)

Uploaded Source

Built Distribution

endi_split_pdf-1.0.10-py3-none-any.whl (31.1 kB view details)

Uploaded Python 3

File details

Details for the file endi_split_pdf-1.0.10.tar.gz.

File metadata

  • Download URL: endi_split_pdf-1.0.10.tar.gz
  • Upload date:
  • Size: 29.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.8

File hashes

Hashes for endi_split_pdf-1.0.10.tar.gz
Algorithm Hash digest
SHA256 37e20b89339469fcaf0fdd00b32e636a48536224a90e0ff8ac6c470ca0cf0e7b
MD5 d6a9d896c4b76654fb375a5bd6d56c6c
BLAKE2b-256 bf4a4da43d92c2e5eea518bf0f7ed5ccb1988aed91312476b6a0576c34cad0ee

See more details on using hashes here.

File details

Details for the file endi_split_pdf-1.0.10-py3-none-any.whl.

File metadata

  • Download URL: endi_split_pdf-1.0.10-py3-none-any.whl
  • Upload date:
  • Size: 31.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.8

File hashes

Hashes for endi_split_pdf-1.0.10-py3-none-any.whl
Algorithm Hash digest
SHA256 1a9a056df214149d933eb21d616d4b7ad108d52b33ab59d619b4b668701865f9
MD5 a4baaaced3bb580528deff315c784363
BLAKE2b-256 ec50bbbac1691a6750bc4c2a4d095b5cfeb083f6e4ea0b749d8668d07e199184

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page