Skip to main content

pdf2table is a powerful Python tool designed to streamline the extraction of tabular data from PDF documents.

Project description

pdf2table

PyPI PyPI - Downloads GitHub

pdf2table is a Python library designed to extract tabular data from PDF files and images efficiently and accurately. It leverages an enhanced algorithm of img2table library for table detection and the TATR model from Microsoft's Table Transformer for precise table structure recognition and content extraction.

Features

  • High Precision of Detection: Compared to Table Transformer's DETR model, rule-based algorithm is less likely to identify text blocks as table regions.
  • Maintenance Structural Information: Utilizes state-of-the-art models for table structure recognition to maintain structural information of tables.
  • Flexible Input: Supports both PDF files and image formats for table extraction. (More file format will be available later)
  • Easy to Use: Simple API allows for straightforward integration into Python projects.

Installation

Install pdf2table using pip:

pip install pdf2table

Usage

Here's a quick example on how to use PDF2Table to extract tables from a PDF file:

from pdf2table import Driver

# Initialize the driver
driver = Driver()

# Extract tables from a PDF
# which returns a list of dataframes
tables = driver.extract_tables("sample.pdf")

Driver object encapsulates the detection and extraction for both PDF object and Image object. If detection is what you need, please refer to the following example:

from pdf2table.document import Image, PDF

# Initialize an Image object
img = Image("sample.jpg")

# Extract all tables from the image
# which returns a list of Table objects
img_tables = img.extract_tables()

# Initialize an PDF object
pdf = PDF("sample.jpg")
pdf_tables = pdf.extract_tables()

You may refer to tutorial for more details

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

Thanks to the creators of the img2table library and Microsoft's Table Transformer model for providing the robust foundations for this tool.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2table-0.1.4.tar.gz (49.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf2table-0.1.4-py3-none-any.whl (64.6 kB view details)

Uploaded Python 3

File details

Details for the file pdf2table-0.1.4.tar.gz.

File metadata

  • Download URL: pdf2table-0.1.4.tar.gz
  • Upload date:
  • Size: 49.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.7

File hashes

Hashes for pdf2table-0.1.4.tar.gz
Algorithm Hash digest
SHA256 26a094fc41f502cfd3722ac2f4e79140c526e176a989c45c095e4f954e8c7d61
MD5 1ae607d08df375bfa7f9c95d5a8fe09c
BLAKE2b-256 97257c8f16530115c98e117d39ccc9b2bd4563e1ba40963fc0598372ab11f8d4

See more details on using hashes here.

File details

Details for the file pdf2table-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: pdf2table-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 64.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.7

File hashes

Hashes for pdf2table-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 6f8d0e0934435bdfcd9d629f7353e5b170dd079cc1a58d0f097cda174fda7c3e
MD5 25eefce9423e80480da00586bb872194
BLAKE2b-256 9dbf3f3d5eb1e8101841735ae680dc329955a674a06f927bb890f856d53a49c0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page