A web interface to extract tabular data from PDFs

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

vinayakmehta

These details have not been verified by PyPI

Project links

Documentation

Project description

Excalibur: A web interface to extract tabular data from PDFs

Excalibur is a web interface to extract tabular data from PDFs, written in Python 3! It is powered by Camelot.

Note: Excalibur only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)

Using Excalibur

Note: You need to install ghostscript before moving forward.

After installing Excalibur with pip, you need to initialize the metadata database using:

$ excalibur initdb

And then start the webserver using:

$ excalibur webserver

That's it! Now you can go to http://localhost:5000 and start extracting tabular data from your PDFs.

Upload a PDF and enter the page numbers you want to extract tables from.
Go to each page and select the table by drawing a box around it. (You can choose to skip this step since Excalibur can automatically detect tables on its own. Click on "Autodetect tables" to see what Excalibur sees.)
Choose a flavor (Lattice or Stream) from "Advanced".

a. Lattice: For tables formed with lines.

b. Stream: For tables formed with whitespaces.
Click on "View and download data" to see the extracted tables.
Select your favorite format (CSV/Excel/JSON/HTML) and click on "Download"!

Note: You can also download executables for Windows and Linux from the releases page and run them directly!

Why Excalibur?

Extracting tables from PDFs is hard. A simple copy-and-paste from a PDF into an Excel doesn't preserve table structure. Excalibur makes PDF table extraction very easy, by automatically detecting tables in PDFs and letting you save them into CSVs and Excel files.
Excalibur uses Camelot under the hood, which gives you additional settings to tweak table extraction and get the best results. You can see how it performs better than other open-source tools and libraries in this comparison.
You can save table extraction settings (like table areas) for a PDF once, and apply them on new PDFs to extract tables with similar structures.
You get complete control over your data. All file storage and processing happens on your own local or remote machine.
Excalibur can be configured with MySQL and Celery for parallel and distributed workloads. By default, sqlite and multiprocessing are used for sequential workloads.

Installation

Using pip

After installing ghostscript, which is one of the requirements for Camelot (See install instructions), you can simply use pip to install Excalibur:

$ pip install excalibur-py

From the source code

After installing ghostscript, clone the repo using:

$ git clone https://www.github.com/camelot-dev/excalibur

and install Excalibur using pip:

$ cd excalibur
$ pip install .

Documentation

Fantastic documentation is available at http://excalibur-py.readthedocs.io/.

Development

The Contributor's Guide has detailed information about contributing code, documentation, tests and more. We've included some basic information in this README.

Source code

You can check the latest sources with:

$ git clone https://www.github.com/camelot-dev/excalibur

Setting up a development environment

You can install the development dependencies easily, using pip:

$ pip install excalibur-py[dev]

Testing (soon)

After installation, you can run tests using:

$ python setup.py test

Versioning

Excalibur uses Semantic Versioning. For the available versions, see the tags on this repository. For the changelog, you can check out HISTORY.md.

License

This project is licensed under the MIT License, see the LICENSE file for details.

Support the development

You can support our work on Excalibur with a one-time or monthly donation on OpenCollective. Organizations who use Excalibur can also sponsor the project for an acknowledgement on our official site and this README.

Special thanks to all the users and organizations that support Excalibur!

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

vinayakmehta

These details have not been verified by PyPI

Project links

Documentation

Release history Release notifications | RSS feed

This version

1.0.1

Jan 3, 2025

1.0.0

Jan 3, 2025

0.4.4

Dec 25, 2024

0.4.3

Mar 21, 2020

0.4.2

Jan 9, 2019

0.4.1

Dec 27, 2018

0.4.0

Nov 25, 2018

0.3.0

Nov 12, 2018

0.2.1

Nov 6, 2018

0.2.0

Nov 5, 2018

0.1.1

Oct 22, 2018

0.1.0

Oct 21, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

excalibur_py-1.0.1.tar.gz (1.5 MB view details)

Uploaded Jan 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

excalibur_py-1.0.1-py3-none-any.whl (1.5 MB view details)

Uploaded Jan 3, 2025 Python 3

File details

Details for the file excalibur_py-1.0.1.tar.gz.

File metadata

Download URL: excalibur_py-1.0.1.tar.gz
Upload date: Jan 3, 2025
Size: 1.5 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for excalibur_py-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`b0eabf92c5d02625e49599031c447d544978df9bb820a5a3fbb3935f9c5660bd`
MD5	`13c4c7e8925c42a6acf3ce0943077ccc`
BLAKE2b-256	`e36a8062fc85cdf59712e07c36c27a8c1ab3b63cb8796275ffb1508eaf676036`

See more details on using hashes here.

Provenance

The following attestation bundles were made for excalibur_py-1.0.1.tar.gz:

Publisher: release.yml on camelot-dev/excalibur

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: excalibur_py-1.0.1.tar.gz
- Subject digest: b0eabf92c5d02625e49599031c447d544978df9bb820a5a3fbb3935f9c5660bd
- Sigstore transparency entry: 159303797
- Sigstore integration time: Jan 3, 2025
Source repository:
- Permalink: camelot-dev/excalibur@6bc1e13c705b7db1ff0031758e684dafcc283953
- Branch / Tag: refs/tags/v1.0.1
- Owner: https://github.com/camelot-dev
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@6bc1e13c705b7db1ff0031758e684dafcc283953
- Trigger Event: push

File details

Details for the file excalibur_py-1.0.1-py3-none-any.whl.

File metadata

Download URL: excalibur_py-1.0.1-py3-none-any.whl
Upload date: Jan 3, 2025
Size: 1.5 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for excalibur_py-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`02358e74c9a87e52ccd4bd5408c87739653e77d767a8348ed8213a29fb4e3ca4`
MD5	`38ca1dc37a27900d9a348899508321ad`
BLAKE2b-256	`ea3113123352f2805bcf478b50013435a30ae678947f736ebc0475e24bafa0e7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for excalibur_py-1.0.1-py3-none-any.whl:

Publisher: release.yml on camelot-dev/excalibur

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: excalibur_py-1.0.1-py3-none-any.whl
- Subject digest: 02358e74c9a87e52ccd4bd5408c87739653e77d767a8348ed8213a29fb4e3ca4
- Sigstore transparency entry: 159303798
- Sigstore integration time: Jan 3, 2025
Source repository:
- Permalink: camelot-dev/excalibur@6bc1e13c705b7db1ff0031758e684dafcc283953
- Branch / Tag: refs/tags/v1.0.1
- Owner: https://github.com/camelot-dev
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@6bc1e13c705b7db1ff0031758e684dafcc283953
- Trigger Event: push

excalibur-py 1.0.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Excalibur: A web interface to extract tabular data from PDFs

Using Excalibur

Why Excalibur?

Installation

Using pip

From the source code

Documentation

Development

Source code

Setting up a development environment

Testing (soon)

Versioning

License

Support the development

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance