Skip to main content

pdf2table is a powerful Python tool designed to streamline the extraction of tabular data from PDF documents.

Project description

pdf2table

PyPI PyPI - Downloads GitHub

pdf2table is a Python library designed to extract tabular data from PDF files and images efficiently and accurately. It leverages an enhanced algorithm of img2table library for table detection and the TATR model from Microsoft's Table Transformer for precise table structure recognition and content extraction.

Features

  • High Precision of Detection: Compared to Table Transformer's DETR model, rule-based algorithm is less likely to identify text blocks as table regions.
  • Maintenance Structural Information: Utilizes state-of-the-art models for table structure recognition to maintain structural information of tables.
  • Flexible Input: Supports both PDF files and image formats for table extraction. (More file format will be available later)
  • Easy to Use: Simple API allows for straightforward integration into Python projects.

Installation

Install pdf2table using pip:

pip install pdf2table

Usage

Here's a quick example on how to use PDF2Table to extract tables from a PDF file:

from pdf2table import Driver

# Initialize the driver
driver = Driver()

# Extract tables from a PDF
# which returns a list of dataframes
tables = driver.extract_tables("sample.pdf")

Driver object encapsulates the detection and extraction for both PDF object and Image object. If detection is what you need, please refer to the following example:

from pdf2table.document import Image, PDF

# Initialize an Image object
img = Image("sample.jpg")

# Extract all tables from the image
# which returns a list of Table objects
img_tables = img.extract_tables()

# Initialize an PDF object
pdf = PDF("sample.jpg")
pdf_tables = pdf.extract_tables()

You may refer to tutorial for more details

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

Thanks to the creators of the img2table library and Microsoft's Table Transformer model for providing the robust foundations for this tool.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2table-0.1.4.tar.gz (49.9 kB view hashes)

Uploaded Source

Built Distribution

pdf2table-0.1.4-py3-none-any.whl (64.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page