Python3 implementation of Kyle Cronan's pdftable module, with unit tests
Project description
This is a Python 3 module and command line utility that analyzes XML output from the
program `pdftohtml` in order to extract tables from PDF files and output the data as CSV.
For example:
pdftohtml -xml -stdout file.pdf | pdftable -f file%d.csv
See also `pdftable -h` and http://sourceforge.net/projects/pdftable
Original author: (c) 2009 Kyle Cronan <kyle@pbx.org>
This Python 3 implementation: (c) 2017 Phil Gooch
As per Kyle's code, this version is licensed under GPLv3. See LICENSE file.
# Installation
Install `pdftohtml` via `poppler-utils` (Linux) or `poppler` (Mac OSX)
Then install the module
python setup.py install
or
pip install pdftablr
## Command line usage
pdftohtml -xml -stdout file.pdf | pdftable -f file%d.csv
## Module usage
from pdftablr.table_extractor import Extractor
# XML file created from pdftohtml
input_path = '/path/to/file.xml'
# Output CSV file
output_path = '/path/to/output.csv'
with open(output_path, 'w') as output_file:
table_extractor = Extractor(output_file=output_file)
with open(input_path) as f:
table_extractor.read_file(f)
tables = table_extractor.extract()
for table in tables:
table.output(writer=None)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pdftablr-0.1.0.tar.gz
(7.4 kB
view hashes)