Skip to main content

Python3 implementation of Kyle Cronan's pdftable module, with unit tests

Project description


This is a Python 3 module and command line utility that analyzes XML output from the
program `pdftohtml` in order to extract tables from PDF files and output the data as CSV.

For example:

pdftohtml -xml -stdout file.pdf | pdftable -f file%d.csv

See also `pdftable -h` and http://sourceforge.net/projects/pdftable

Original author: (c) 2009 Kyle Cronan <kyle@pbx.org>

This Python 3 implementation: (c) 2017 Phil Gooch

As per Kyle's code, this version is licensed under GPLv3. See LICENSE file.

# Installation

Install `pdftohtml` via `poppler-utils` (Linux) or `poppler` (Mac OSX)

Then install the module

python setup.py install

or
pip install pdftablr

## Command line usage

pdftohtml -xml -stdout file.pdf | pdftable -f file%d.csv

## Module usage

from pdftablr.table_extractor import Extractor

# XML file created from pdftohtml
input_path = '/path/to/file.xml'

# Output CSV file
output_path = '/path/to/output.csv'

with open(output_path, 'w') as output_file:
table_extractor = Extractor(output_file=output_file)

with open(input_path) as f:
table_extractor.read_file(f)

tables = table_extractor.extract()
for table in tables:
table.output(writer=None)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdftablr-0.1.0.tar.gz (7.4 kB view details)

Uploaded Source

File details

Details for the file pdftablr-0.1.0.tar.gz.

File metadata

  • Download URL: pdftablr-0.1.0.tar.gz
  • Upload date:
  • Size: 7.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for pdftablr-0.1.0.tar.gz
Algorithm Hash digest
SHA256 485c4ce1df97176231ff6269eaaab3c1c58f578b5cea395c645ebb4f5e662a1f
MD5 cd004355619ecbf2883360170ad71dfe
BLAKE2b-256 2fb6e9c613e989f7e95d21e83d988fd27141e52460b1d973a4f6a74051d0af5a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page