Skip to main content

PDF Table Extraction for Humans.

Reason this release was yanked:

Will cause an import error due to a breaking change in pdfminer.six

Project description

Camelot: PDF Table Extraction for Humans

Build Status Documentation Status codecov.io image image image Gitter chat image image

Camelot is a Python library that makes it easy for anyone to extract tables from PDF files!

Note: You can also check out Excalibur, which is a web interface for Camelot!


Here's how you can extract tables from PDF files. Check out the PDF used in this example here.

>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf')
>>> tables
<TableList n=1>
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, sqlite
>>> tables[0]
<Table shape=(7, 7)>
>>> tables[0].parsing_report
{
    'accuracy': 99.02,
    'whitespace': 12.24,
    'order': 1,
    'page': 1
}
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_sqlite
>>> tables[0].df # get a pandas DataFrame!
Cycle Name KI (1/km) Distance (mi) Percent Fuel Savings
Improved Speed Decreased Accel Eliminate Stops Decreased Idle
2012_2 3.30 1.3 5.9% 9.5% 29.2% 17.4%
2145_1 0.68 11.2 2.4% 0.1% 9.5% 2.7%
4234_1 0.59 58.7 8.5% 1.3% 8.5% 3.3%
2032_2 0.17 57.8 21.7% 0.3% 2.7% 1.2%
4171_1 0.07 173.9 58.1% 1.6% 2.1% 0.5%

There's a command-line interface too!

Note: Camelot only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)

Why Camelot?

  • You are in control.: Unlike other libraries and tools which either give a nice output or fail miserably (with no in-between), Camelot gives you the power to tweak table extraction. (This is important since everything in the real world, including PDF table extraction, is fuzzy.)
  • Bad tables can be discarded based on metrics like accuracy and whitespace, without ever having to manually look at each table.
  • Each table is a pandas DataFrame, which seamlessly integrates into ETL and data analysis workflows.
  • Export to multiple formats, including JSON, Excel, HTML and Sqlite.

See comparison with other PDF table extraction libraries and tools.

Installation

Using conda

The easiest way to install Camelot is to install it with conda, which is a package manager and environment management system for the Anaconda distribution.

$ conda install -c conda-forge camelot-py

Using pip

After installing the dependencies (tk and ghostscript), you can simply use pip to install Camelot:

$ pip install "camelot-py[cv]"

From the source code

After installing the dependencies, clone the repo using:

$ git clone https://www.github.com/camelot-dev/camelot

and install Camelot using pip:

$ cd camelot
$ pip install ".[cv]"

Documentation

Great documentation is available at http://camelot-py.readthedocs.io/.

Development

The Contributor's Guide has detailed information about contributing code, documentation, tests and more. We've included some basic information in this README.

Source code

You can check the latest sources with:

$ git clone https://www.github.com/camelot-dev/camelot

Setting up a development environment

You can install the development dependencies easily, using pip:

$ pip install "camelot-py[dev]"

Testing

After installation, you can run tests using:

$ python setup.py test

Wrappers

Versioning

Camelot uses Semantic Versioning. For the available versions, see the tags on this repository. For the changelog, you can check out HISTORY.md.

License

This project is licensed under the MIT License, see the LICENSE file for details.

Support the development

You can support our work on Camelot with a one-time or monthly donation on OpenCollective. Organizations who use camelot can also sponsor the project for an acknowledgement on our documentation site and this README.

Special thanks to all the users, organizations and contributors that support Camelot!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

camelot-py-0.8.1.tar.gz (38.2 kB view details)

Uploaded Source

Built Distribution

camelot_py-0.8.1-py3-none-any.whl (42.8 kB view details)

Uploaded Python 3

File details

Details for the file camelot-py-0.8.1.tar.gz.

File metadata

  • Download URL: camelot-py-0.8.1.tar.gz
  • Upload date:
  • Size: 38.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.6.9

File hashes

Hashes for camelot-py-0.8.1.tar.gz
Algorithm Hash digest
SHA256 c18e6ed7a113fb3bf8f60a3c07e4df09c08e921f4eb27ba33d61733183304862
MD5 ce246c314b6d245d19a6d10326eefb6a
BLAKE2b-256 bb7c32597004651ea360c65ccffa53abd203d39aa21563425aef6ab510b1dbcc

See more details on using hashes here.

File details

Details for the file camelot_py-0.8.1-py3-none-any.whl.

File metadata

  • Download URL: camelot_py-0.8.1-py3-none-any.whl
  • Upload date:
  • Size: 42.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.6.9

File hashes

Hashes for camelot_py-0.8.1-py3-none-any.whl
Algorithm Hash digest
SHA256 42bf041cddd6accaaae53fce5a318fa2a175de29fe4ecd53ddb46bdf856c6270
MD5 50993f923180ce780bc943b2f2b3f449
BLAKE2b-256 b6cd1735f37d2e2fa44cb6be753bf353049272e7e4c71c1713953faacf210db7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page