Python3 implementation of Kyle Cronan's pdftable module, with unit tests
Project description
This is a Python 3 module and command line utility that analyzes XML output from the
program `pdftohtml` in order to extract tables from PDF files and output the data as CSV.
For example:
pdftohtml -xml -stdout file.pdf | pdftable -f file%d.csv
See also `pdftable -h` and http://sourceforge.net/projects/pdftable
Original author: (c) 2009 Kyle Cronan <kyle@pbx.org>
This Python 3 implementation: (c) 2017 Phil Gooch
As per Kyle's code, this version is licensed under GPLv3. See LICENSE file.
# Installation
Install `pdftohtml` via `poppler-utils` (Linux) or `poppler` (Mac OSX)
Then install the module
python setup.py install
or
pip install pdftablr
## Command line usage
pdftohtml -xml -stdout file.pdf | pdftable -f file%d.csv
## Module usage
from pdftablr.table_extractor import Extractor
# XML file created from pdftohtml
input_path = '/path/to/file.xml'
# Output CSV file
output_path = '/path/to/output.csv'
with open(output_path, 'w') as output_file:
table_extractor = Extractor(output_file=output_file)
with open(input_path) as f:
table_extractor.read_file(f)
tables = table_extractor.extract()
for table in tables:
table.output(writer=None)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pdftablr-0.1.0.tar.gz
(7.4 kB
view details)
File details
Details for the file pdftablr-0.1.0.tar.gz
.
File metadata
- Download URL: pdftablr-0.1.0.tar.gz
- Upload date:
- Size: 7.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 485c4ce1df97176231ff6269eaaab3c1c58f578b5cea395c645ebb4f5e662a1f |
|
MD5 | cd004355619ecbf2883360170ad71dfe |
|
BLAKE2b-256 | 2fb6e9c613e989f7e95d21e83d988fd27141e52460b1d973a4f6a74051d0af5a |