img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processing

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3 :: Only

Project description

img2table

img2table is a simple, easy to use, table identification and extraction Python Library based on OpenCV image processing that supports most common image file formats as well as PDF files.

Thanks to its design, it provides a practical and lighter alternative to Neural Networks based solutions, especially for usage on CPU.

Installation
Features
Supported file formats
- Images
- PDF
Usage
Examples
Caveats / FYI

Installation

The library can be installed via pip.

# Standard installation, supporting Tesseract
pip install img2table

# For usage with Paddle OCR (Python <= 3.10 only)
pip install img2table[paddle]
# For usage with Paddle OCR - GPU (CUDA 9 / CUDA 10) (Python <= 3.10 only)
pip install img2table[paddle-gpu]

# For usage with Google Vision OCR
pip install img2table[gcp]

# For usage with AWS Textract OCR
pip install img2table[aws]

# For usage with Azure Cognitive Services OCR
pip install img2table[azure]

Features

Table identification for images and PDF files, including bounding boxes at the table cell level
Handling of complex table structures such as merged cells
Handling of implicit rows - see example
Table content extraction by providing support for OCR services / tools
Extracted tables are returned as a simple object, including a Pandas DataFrame representation
Export extracted tables to an Excel file, preserving their original structure

Supported file formats

Images

Images are loaded using the opencv-python library, supported formats are listed below.

Windows bitmaps - .bmp, .dib

JPEG files - .jpeg, .jpg, *.jpe

JPEG 2000 files - *.jp2

Portable Network Graphics - *.png

WebP - *.webp

Portable image format - .pbm, .pgm, .ppm .pxm, *.pnm

PFM files - *.pfm

Sun rasters - .sr, .ras

TIFF files - .tiff, .tif

OpenEXR Image files - *.exr

Radiance HDR - .hdr, .pic

Raster and Vector geospatial data supported by GDAL
OpenCV: Image file reading and writing

Multi-page images are not supported.

PDF

Searchable and non-searchable PDF files are supported.

Usage

Documents

Images

Images are instantiated as follows :

from img2table.document import Image

image = Image(src, 
              dpi=200,
              detect_rotation=False)

Parameters

src : str, pathlib.Path, bytes or io.BytesIO, required

Image source

dpi : int, optional, default 200

Estimated image dpi, used to adapt OpenCV algorithm parameters

detect_rotation : bool, optional, default False

Detect and correct skew/rotation of the image

:warning::warning::warning: Disclaimer
The implemented method to handle skewed/rotated images is approximate and might not work on every image. It is preferable to pass well-oriented images as inputs.
Moreover, when setting the detect_rotation parameter to True, image coordinates and bounding boxes returned by other methods might not correspond to the original image.

PDF

PDF files are instantiated as follows :

from img2table.document import PDF

pdf = PDF(src, dpi=200, pages=[0, 2])

Parameters

src : str, pathlib.Path, bytes or io.BytesIO, required

PDF source

dpi : int, optional, default 200

Dpi used for conversion of PDF pages to images

pages : list, optional, default None

List of PDF page indexes to be processed. If None, all pages are processed

OCR

img2table provides an interface for several OCR services and tools in order to parse table content.
If possible (i.e for searchable PDF), PDF text will be extracted directly from the file and the OCR service/tool will not be called.

Tesseract

from img2table.ocr import TesseractOCR

ocr = TesseractOCR(n_threads=1, lang="eng", tessdata_dir="...")

Parameters

n_threads : int, optional, default 1

Number of concurrent threads used to call Tesseract

lang : str, optional, default "eng"

Lang parameter used in Tesseract for text extraction

tessdata_dir : str, optional, default None

Directory containing Tesseract traineddata files. If None, the TESSDATA_PREFIX env variable is used.

Usage of Tesseract-OCR requires prior installation. Check documentation for instructions.

PaddleOCR

Available for Python versions <= 3.10

PaddleOCR is an open-source OCR based on Deep Learning models.
At first use, relevant languages models will be downloaded.

from img2table.ocr import PaddleOCR

ocr = PaddleOCR(lang="en")

Parameters

lang : str, optional, default "en"

Lang parameter used in Paddle for text extraction, check documentation for available languages

Released in version 0.0.13

Google Vision

Authentication to GCP can be done by setting the standard GOOGLE_APPLICATION_CREDENTIALS environment variable.
If this variable is missing, an API key should be provided via the api_key parameter.

from img2table.ocr import VisionOCR

ocr = VisionOCR(api_key="api_key", timeout=15)

Parameters

api_key : str, optional, default None

Google Vision API key

timeout : int, optional, default 15

API requests timeout, in seconds

AWS Textract

When using AWS Textract, the DetectDocumentText API is exclusively called.

Authentication to AWS can be done by passing credentials to the TextractOCR class.
If credentials are not provided, authentication is done using environment variables or configuration files. Check boto3 documentation for more details.

from img2table.ocr import TextractOCR

ocr = TextractOCR(aws_access_key_id="***",
                  aws_secret_access_key="***",
                  aws_session_token="***",
                  region="eu-west-1")

Parameters

aws_access_key_id : str, optional, default None

AWS access key id

aws_secret_access_key : str, optional, default None

AWS secret access key

aws_session_token : str, optional, default None

AWS temporary session token

region : str, optional, default None

AWS server region

Azure Cognitive Services

from img2table.ocr import AzureOCR

ocr = AzureOCR(endpoint="abc.azure.com",
               subscription_key="***")

Parameters

endpoint : str, optional, default None

Azure Cognitive Services endpoint. If None, inferred from the COMPUTER_VISION_ENDPOINT environment variable.

subscription_key : str, optional, default None

Azure Cognitive Services subscription key. If None, inferred from the COMPUTER_VISION_SUBSCRIPTION_KEY environment variable.

Table extraction

Multiple tables can be extracted at once from a PDF page/ an image using the extract_tables method of a document.

from img2table.ocr import TesseractOCR
from img2table.document import Image

# Instantiation of OCR
ocr = TesseractOCR(n_threads=1, lang="eng")

# Instantiation of document, either an image or a PDF
doc = Image(src, dpi=200)

# Table extraction
extracted_tables = doc.extract_tables(ocr=ocr,
                                      implicit_rows=True,
                                      borderless_tables=False,
                                      min_confidence=50)

Parameters

ocr : OCRInstance, optional, default None

OCR instance used to parse document text. If None, cells content will not be extracted

implicit_rows : bool, optional, default True

Boolean indicating if implicit rows should be identified - check related example

borderless_tables : bool, optional, default False

Boolean indicating if borderless tables are extracted. It requires to provide an OCR to the method in order to be performed - feature in alpha version

min_confidence : int, optional, default 50

Minimum confidence level from OCR in order to process text, from 0 (worst) to 99 (best)

Borderless table extraction released in version 0.0.14

NB: the implemented method for extraction of borderless tables heavily relies on OCR quality. In order to achieve decent results, it is recommended to use PaddleOCR or one of the supported commercial solutions.

Method return

The ExtractedTable class is used to model extracted tables from documents.

Attributes

bbox : BBox

Table bounding box

title : str

Extracted title of the table

content : OrderedDict

Dict with with row indexes as keys and list of TableCell objects as values

df : pd.DataFrame

Pandas DataFrame representation of the table

Images

extract_tables method from the Image class returns a list of ExtractedTable objects.

output = [ExtractedTable(...), ExtractedTable(...), ...]

PDF

extract_tables method from the PDF class returns an OrderedDict object with page indexes as keys and lists of ExtractedTable objects.

output = {
    0: [ExtractedTable(...), ...],
    1: [],
    ...
    last_page: [ExtractedTable(...), ...]
}

Excel export

Tables extracted from a document can be exported to a xlsx file. The resulting file is composed of one worksheet per extracted table.
Method arguments are mostly common with the extract_tables method.

from img2table.ocr import TesseractOCR
from img2table.document import Image

# Instantiation of OCR
ocr = TesseractOCR(n_threads=1, lang="eng")

# Instantiation of document, either an image or a PDF
doc = Image(src, dpi=200)

# Extraction of tables and creation of an xlsx file containing tables
doc.to_xlsx(dest=dest,
            ocr=ocr,
            implicit_rows=True,
            min_confidence=50)

Parameters

dest : str, pathlib.Path or io.BytesIO, required

Destination for xlsx file

ocr : OCRInstance, optional, default None

OCR instance used to parse document text. If None, cells content will not be extracted

implicit_rows : bool, optional, default True

Boolean indicating if implicit rows should be identified - check related example

min_confidence : int, optional, default 50

Minimum confidence level from OCR in order to process text, from 0 (worst) to 99 (best)

Returns
If a io.BytesIO buffer is passed as dest arg, it is returned containing xlsx data

Examples

Several Jupyter notebooks with examples are available :

Basic usage: generic library usage, including examples with images, PDF and OCRs
Borderless tables: specific examples dedicated to the extraction of borderless tables
Implicit rows: illustrated effect of the parameter implicit_rows of the extract_tables method

Caveats / FYI

For table extraction, results are highly dependent on OCR quality. By design, tables where no OCR data can be found are not returned.
The library is tailored for usage on documents with white/light background. Effectiveness can not be guaranteed on other type of documents.
Borderless tables extraction is still in alpha stage and might be inconsistent on complex cases. As an example, tables with multi-lines cells can cause troubles.
Improvements to the algorithm will be released in future versions.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3 :: Only

Release history Release notifications | RSS feed

1.4.2

Aug 10, 2025

1.4.1

Feb 9, 2025

1.4.0

Nov 11, 2024

1.3.1

Oct 27, 2024

1.3.0

Sep 1, 2024

1.2.11

Feb 26, 2024

1.2.10

Feb 11, 2024

1.2.9

Feb 11, 2024

1.2.8

Jan 2, 2024

1.2.7

Dec 31, 2023

1.2.6

Dec 16, 2023

1.2.5

Dec 3, 2023

1.2.4

Nov 22, 2023

1.2.3

Oct 18, 2023

1.2.2

Oct 16, 2023

1.2.1

Sep 27, 2023

1.2.0

Sep 19, 2023

1.0.11

Aug 17, 2023

1.0.10

Aug 10, 2023

1.0.9

Aug 3, 2023

1.0.8

Jul 31, 2023

1.0.7

Jul 14, 2023

1.0.6

Jul 10, 2023

1.0.5

Jun 21, 2023

1.0.4

Jun 12, 2023

1.0.3

Jun 12, 2023

1.0.2

Jun 5, 2023

1.0.1

Jun 1, 2023

1.0.0

May 28, 2023

0.1.4

May 22, 2023

0.1.3

May 9, 2023

0.1.2

May 5, 2023

0.1.1

Apr 25, 2023

0.1.0

Apr 20, 2023

0.0.24

Apr 11, 2023

0.0.23

Apr 8, 2023

0.0.22

Apr 7, 2023

0.0.21

Apr 5, 2023

0.0.20

Mar 28, 2023

0.0.19

Mar 12, 2023

This version

0.0.18

Mar 6, 2023

0.0.17

Feb 27, 2023

0.0.16

Feb 26, 2023

0.0.15

Feb 22, 2023

0.0.14

Feb 15, 2023

0.0.13

Feb 13, 2023

0.0.12

Feb 10, 2023

0.0.11

Feb 10, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

img2table-0.0.18.tar.gz (1.7 MB view details)

Uploaded Mar 6, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

img2table-0.0.18-py3-none-any.whl (62.1 kB view details)

Uploaded Mar 6, 2023 Python 3

File details

Details for the file img2table-0.0.18.tar.gz.

File metadata

Download URL: img2table-0.0.18.tar.gz
Upload date: Mar 6, 2023
Size: 1.7 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.11.2

File hashes

Hashes for img2table-0.0.18.tar.gz
Algorithm	Hash digest
SHA256	`6d8c0a3872f3e471fda6b3fe5ce8ad5d5454b5f5365facf3a7dd07a5916faae3`
MD5	`6f9463885f418d71130cd64fda3a615b`
BLAKE2b-256	`52cb39ad64b68f245b50bc2be0995e0605c70d7a4cf6d922317a9ba9b6e8f689`

See more details on using hashes here.

File details

Details for the file img2table-0.0.18-py3-none-any.whl.

File metadata

Download URL: img2table-0.0.18-py3-none-any.whl
Upload date: Mar 6, 2023
Size: 62.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.11.2

File hashes

Hashes for img2table-0.0.18-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dfd3c331e3cdc5adbbbb202fc32ab4811c4241974e915861e87297029aa6774e`
MD5	`f69dd5602fe4c78c222bb5d385bb91b2`
BLAKE2b-256	`b7af715ea10a4423544894a63d803ef94e5e794c4cceddd3ec6fc098db223ccb`

See more details on using hashes here.

img2table 0.0.18

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

img2table

Table of contents

Installation

Features

Supported file formats

Images

PDF

Usage

Documents

Images

Parameters

PDF

Parameters

OCR

Tesseract

Parameters

PaddleOCR

Parameters

Google Vision

Parameters

AWS Textract

Parameters

Azure Cognitive Services

Parameters

Table extraction

Parameters

Method return

Attributes

Images

PDF

Excel export

Parameters

Returns

Examples

Caveats / FYI

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes