pdftablr

Python3 implementation of Kyle Cronan's pdftable module, with unit tests

These details have not been verified by PyPI

Project links

Project description

This is a Python 3 module and command line utility that analyzes XML output from the
program `pdftohtml` in order to extract tables from PDF files and output the data as CSV.

For example:

pdftohtml -xml -stdout file.pdf | pdftable -f file%d.csv

See also `pdftable -h` and http://sourceforge.net/projects/pdftable

Original author: (c) 2009 Kyle Cronan <kyle@pbx.org>

This Python 3 implementation: (c) 2017 Phil Gooch

As per Kyle's code, this version is licensed under GPLv3. See LICENSE file.

# Installation

Install `pdftohtml` via `poppler-utils` (Linux) or `poppler` (Mac OSX)

Then install the module

python setup.py install

or
pip install pdftablr

## Command line usage

pdftohtml -xml -stdout file.pdf | pdftable -f file%d.csv

## Module usage

from pdftablr.table_extractor import Extractor

# XML file created from pdftohtml
input_path = '/path/to/file.xml'

# Output CSV file
output_path = '/path/to/output.csv'

with open(output_path, 'w') as output_file:
table_extractor = Extractor(output_file=output_file)

with open(input_path) as f:
table_extractor.read_file(f)

tables = table_extractor.extract()
for table in tables:
table.output(writer=None)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Nov 1, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdftablr-0.1.0.tar.gz (7.4 kB view details)

Uploaded Nov 1, 2017 Source

File details

Details for the file pdftablr-0.1.0.tar.gz.

File metadata

Download URL: pdftablr-0.1.0.tar.gz
Upload date: Nov 1, 2017
Size: 7.4 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for pdftablr-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`485c4ce1df97176231ff6269eaaab3c1c58f578b5cea395c645ebb4f5e662a1f`
MD5	`cd004355619ecbf2883360170ad71dfe`
BLAKE2b-256	`2fb6e9c613e989f7e95d21e83d988fd27141e52460b1d973a4f6a74051d0af5a`

See more details on using hashes here.

pdftablr 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes