Skip to main content

Python3 implementation of Kyle Cronan's pdftable module, with unit tests

Project description

This is a Python 3 module and command line utility that analyzes XML output from the
program `pdftohtml` in order to extract tables from PDF files and output the data as CSV.

For example:

pdftohtml -xml -stdout file.pdf | pdftable -f file%d.csv

See also `pdftable -h` and

Original author: (c) 2009 Kyle Cronan <>

This Python 3 implementation: (c) 2017 Phil Gooch

As per Kyle's code, this version is licensed under GPLv3. See LICENSE file.

# Installation

Install `pdftohtml` via `poppler-utils` (Linux) or `poppler` (Mac OSX)

Then install the module

python install

pip install pdftablr

## Command line usage

pdftohtml -xml -stdout file.pdf | pdftable -f file%d.csv

## Module usage

from pdftablr.table_extractor import Extractor

# XML file created from pdftohtml
input_path = '/path/to/file.xml'

# Output CSV file
output_path = '/path/to/output.csv'

with open(output_path, 'w') as output_file:
table_extractor = Extractor(output_file=output_file)

with open(input_path) as f:

tables = table_extractor.extract()
for table in tables:

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for pdftablr, version 0.1.0
Filename, size File type Python version Upload date Hashes
Filename, size pdftablr-0.1.0.tar.gz (7.4 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page