Skip to main content

img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processing

Project description

img2table

img2table is a simple, easy to use, table identification and extraction Python Library based on OpenCV image processing that supports most common image file formats as well as PDF files.

It also provides implementations for several OCR services and tools in order to parse table contents.

Table of contents

Installation

The library can be installed via pip.

# Standard installation, supporting Tesseract
pip install img2table

# For usage with Paddle OCR
pip install img2table[paddle]
# For usage with Paddle OCR - GPU (CUDA 9 / CUDA 10)
pip install img2table[paddle-gpu]

# For usage with Google Vision OCR
pip install img2table[gcp]

# For usage with AWS Textract OCR
pip install img2table[aws]

# For usage with Azure Cognitive Services OCR
pip install img2table[azure]

Features

  • Table identification for image and PDF files, including bounding boxes at the table cell level
  • Table content extraction by providing support for OCR services / tools
  • Extraction of table titles
  • Handling of merged cells in tables
  • Handling of implicit rows - see example

Supported file formats

Images

Images are loaded using the opencv-python library, supported formats are listed below.

  • Windows bitmaps - .bmp, .dib
  • JPEG files - .jpeg, .jpg, *.jpe
  • JPEG 2000 files - *.jp2
  • Portable Network Graphics - *.png
  • WebP - *.webp
  • Portable image format - .pbm, .pgm, .ppm .pxm, *.pnm
  • PFM files - *.pfm
  • Sun rasters - .sr, .ras
  • TIFF files - .tiff, .tif
  • OpenEXR Image files - *.exr
  • Radiance HDR - .hdr, .pic
  • Raster and Vector geospatial data supported by GDAL
    OpenCV: Image file reading and writing

Multi-page images are not supported.


PDF

Searchable and non-searchable PDF files are supported.

Usage

Documents

Images

Images are instantiated as follows :

from img2table.document import Image

image = Image(src, 
              dpi=200,
              detect_rotation=False)

Parameters

src : str, pathlib.Path, bytes or io.BytesIO, required
Image source
dpi : int, optional, default 200
Estimated image dpi, used to adapt OpenCV algorithm parameters
detect_rotation : bool, optional, default False
Detect and correct skew/rotation of the image

:warning::warning::warning: Disclaimer
The implemented method to handle skewed/rotated images is approximate and might not work on every image. It is preferable to pass well-oriented images as inputs.
Moreover, when setting the detect_rotation parameter to True, image coordinates and bounding boxes returned by other methods might not correspond to the original image.

PDF

PDF files are instantiated as follows :

from img2table.document import PDF

pdf = PDF(src, dpi=200, pages=[0, 2])

Parameters

src : str, pathlib.Path, bytes or io.BytesIO, required
PDF source
dpi : int, optional, default 200
Dpi used for conversion of PDF pages to images
pages : list, optional, default None
List of PDF page indexes to be processed. If None, all pages are processed

OCR

img2table provides an interface for several OCR services and tools in order to parse table content.
If possible (i.e for searchable PDF), PDF text will be extracted directly from the file and the OCR service/tool will not be called.

Tesseract

from img2table.ocr import TesseractOCR

ocr = TesseractOCR(n_threads=1, lang="eng")

Parameters

n_threads : int, optional, default 1
Number of concurrent threads used to call Tesseract
lang : str, optional, default "eng"
Lang parameter used in Tesseract for text extraction

Usage of Tesseract-OCR requires prior installation. Check documentation for instructions.

PaddleOCR

PaddleOCR is an open-source OCR based on Deep Learning models.
At first use, relevant languages models will be downloaded.

from img2table.ocr import PaddleOCR

ocr = PaddleOCR(lang="en")

Parameters

lang : str, optional, default "en"
Lang parameter used in Paddle for text extraction, check documentation for available languages

Google Vision

Authentication to GCP can be done by setting the standard GOOGLE_APPLICATION_CREDENTIALS environment variable.
If this variable is missing, an API key should be provided via the api_key parameter.

from img2table.ocr import VisionOCR

ocr = VisionOCR(api_key="api_key", timeout=15)

Parameters

api_key : str, optional, default None
Google Vision API key
timeout : int, optional, default 15
API requests timeout, in seconds

AWS Textract

When using AWS Textract, the DetectDocumentText API is exclusively called.

Authentication to AWS can be done by passing credentials to the TextractOCR class.
If credentials are not provided, authentication is done using environment variables or configuration files. Check boto3 documentation for more details.

from img2table.ocr import TextractOCR

ocr = TextractOCR(aws_access_key_id="***",
                  aws_secret_access_key="***",
                  aws_session_token="***",
                  region="eu-west-1")

Parameters

aws_access_key_id : str, optional, default None
AWS access key id
aws_secret_access_key : str, optional, default None
AWS secret access key
aws_session_token : str, optional, default None
AWS temporary session token
region : str, optional, default None
AWS server region

Azure Cognitive Services

from img2table.ocr import AzureOCR

ocr = AzureOCR(endpoint="abc.azure.com",
               subscription_key="***")

Parameters

endpoint : str, optional, default None
Azure Cognitive Services endpoint. If None, inferred from the COMPUTER_VISION_ENDPOINT environment variable.
subscription_key : str, optional, default None
Azure Cognitive Services subscription key. If None, inferred from the COMPUTER_VISION_SUBSCRIPTION_KEY environment variable.

Table extraction

Multiple tables can be extracted at once from a PDF page/ an image using the extract_tables method of a document.

from img2table.ocr import TesseractOCR
from img2table.document import Image

# Instantiation of OCR
ocr = TesseractOCR(n_threads=1, lang="eng")

# Instantiation of document, either an image or a PDF
doc = Image(src, dpi=200)

# Table extraction
extracted_tables = doc.extract_tables(ocr=ocr,
                                      implicit_rows=True,
                                      min_confidence=50)

Parameters

ocr : OCRInstance, optional, default None
OCR instance used to parse document text. If None, cells content will not be extracted
implicit_rows : bool, optional, default True
Boolean indicating if implicit rows should be identified - check related example
min_confidence : int, optional, default 50
Minimum confidence level from OCR in order to process text, from 0 (worst) to 99 (best)

Method return

The ExtractedTable class is used to model extracted tables from documents.

Attributes

bbox : BBox
Table bounding box
title : str
Extracted title of the table
content : OrderedDict
Dict with with row indexes as keys and list of TableCell objects as values
df : pd.DataFrame
Pandas DataFrame representation of the table
Images

extract_tables method from the Image class returns a list of ExtractedTable objects.

output = [ExtractedTable(...), ExtractedTable(...), ...]
PDF

extract_tables method from the PDF class returns an OrderedDict object with page indexes as keys and lists of ExtractedTable objects.

output = {
    0: [ExtractedTable(...), ...],
    1: [],
    ...
    last_page: [ExtractedTable(...), ...]
}

Excel export

Tables extracted from a document can be exported to a xlsx file. The resulting file is composed of one worksheet per extracted table.
Method arguments are mostly common with the extract_tables method.

from img2table.ocr import TesseractOCR
from img2table.document import Image

# Instantiation of OCR
ocr = TesseractOCR(n_threads=1, lang="eng")

# Instantiation of document, either an image or a PDF
doc = Image(src, dpi=200)

# Extraction of tables and creation of an xlsx file containing tables
doc.to_xlsx(dest=dest,
            ocr=ocr,
            implicit_rows=True,
            min_confidence=50)

Parameters

dest : str, pathlib.Path or io.BytesIO, required
Destination for xlsx file
ocr : OCRInstance, optional, default None
OCR instance used to parse document text. If None, cells content will not be extracted
implicit_rows : bool, optional, default True
Boolean indicating if implicit rows should be identified - check related example
min_confidence : int, optional, default 50
Minimum confidence level from OCR in order to process text, from 0 (worst) to 99 (best)

Returns

If a io.BytesIO buffer is passed as dest arg, it is returned containing xlsx data

Examples

Several Jupyter notebooks with examples are available :

  • Basic usage: generic library usage, including examples with images, PDF and OCRs
  • Implicit rows: illustrated effect of the parameter implicit_rows of the extract_tables method

Caveats / FYI

  • Table identification only works on tables with borders. Borderless tables are not supported, as they would most likely require NN-based methods.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

img2table-0.0.13.tar.gz (1.0 MB view hashes)

Uploaded Source

Built Distribution

img2table-0.0.13-py3-none-any.whl (45.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page