Skip to main content

PDF Table Extraction for Humans.

Project description

Camelot: PDF Table Extraction for Humans

Build Status codecov.io image image image

Camelot is a Python library which makes it easy for anyone to extract tables from PDF files!


Here's how you can extract tables from PDF files. Check out the PDF used in this example, here.

>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf')
>>> tables
<TableList n=1>
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html
>>> tables[0]
<Table shape=(7, 7)>
>>> tables[0].parsing_report
{
    'accuracy': 99.02,
    'whitespace': 12.24,
    'order': 1,
    'page': 1
}
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html
>>> tables[0].df # get a pandas DataFrame!
Cycle Name KI (1/km) Distance (mi) Percent Fuel Savings
Improved Speed Decreased Accel Eliminate Stops Decreased Idle
2012_2 3.30 1.3 5.9% 9.5% 29.2% 17.4%
2145_1 0.68 11.2 2.4% 0.1% 9.5% 2.7%
4234_1 0.59 58.7 8.5% 1.3% 8.5% 3.3%
2032_2 0.17 57.8 21.7% 0.3% 2.7% 1.2%
4171_1 0.07 173.9 58.1% 1.6% 2.1% 0.5%

There's a command-line interface too!

Note: Camelot only works with text-based PDFs and not scanned documents. If you can click-and-drag to select text in your table in a PDF viewer, then your PDF is text-based.

Why Camelot?

  • You are in control: Unlike other libraries and tools which either give a nice output or fail miserably (with no in-between), Camelot gives you the power to tweak table extraction. (Since everything in the real world, including PDF table extraction, is fuzzy.)
  • Metrics: Bad tables can be discarded based on metrics like accuracy and whitespace, without ever having to manually look at each table.
  • Each table is a pandas DataFrame, which enables seamless integration into ETL and data analysis workflows.
  • Export to multiple formats, including json, excel and html.

See comparison with other PDF table extraction libraries and tools.

Installation

After installing the dependencies (tk and ghostscript), you can simply use pip to install Camelot:

$ pip install camelot-py

Alternatively

After installing the dependencies, clone the repo using:

$ git clone https://www.github.com/socialcopsdev/camelot

and install Camelot using pip:

$ cd camelot
$ pip install .

Note: Use a virtualenv if you don't want to affect your global Python installation.

Documentation

Great documentation is available at http://camelot-py.readthedocs.io/.

Development

The Contributor's Guide has detailed information about contributing code, documentation, tests and more. We've included some basic information in this README.

Source code

You can check the latest sources with:

$ git clone https://www.github.com/socialcopsdev/camelot

Setting up a development environment

You can install the development dependencies easily, using pip:

$ pip install camelot-py[dev]

Testing

After installation, you can run tests using:

$ python setup.py test

Versioning

Camelot uses Semantic Versioning. For the available versions, see the tags on this repository.

License

This project is licensed under the MIT License, see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

camelot-py-0.2.0.tar.gz (24.0 kB view details)

Uploaded Source

Built Distribution

camelot_py-0.2.0-py2.py3-none-any.whl (31.1 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file camelot-py-0.2.0.tar.gz.

File metadata

  • Download URL: camelot-py-0.2.0.tar.gz
  • Upload date:
  • Size: 24.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.5

File hashes

Hashes for camelot-py-0.2.0.tar.gz
Algorithm Hash digest
SHA256 3f146ca6dba0031c8727d3ec4a3ff076a14a12870a29ca5d15d6faaca0bffba2
MD5 07670f27e6796352260d44e6e6d99ec3
BLAKE2b-256 8ae25d636ad0c4f3a921850eca37e2ca50221fe314e3426ac54ebfb3371c32c9

See more details on using hashes here.

File details

Details for the file camelot_py-0.2.0-py2.py3-none-any.whl.

File metadata

  • Download URL: camelot_py-0.2.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 31.1 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.6.5

File hashes

Hashes for camelot_py-0.2.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 09a66546e81930a026724faa9090fd2e0999f038fc64e70e840b24c9791ed44b
MD5 cf7b80c4842b80f768e3de7d7a40887a
BLAKE2b-256 72ddb1d573c55d3429f5ab6c0433c56866d9185286b8eadd7214da8b5838eae0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page