img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processing
Project description
img2table
img2table is a simple, easy to use, table identification and extraction Python Library based on OpenCV image
processing that supports most common image file formats as well as PDF files.
Thanks to its design, it provides a practical and lighter alternative to Neural Networks based solutions, especially for usage on CPU.
Table of contents
Installation
The library can be installed via pip.
# Standard installation, supporting Tesseract
pip install img2table
# For usage with Paddle OCR (Python <= 3.10 only)
pip install img2table[paddle]
# For usage with Paddle OCR - GPU (CUDA 9 / CUDA 10) (Python <= 3.10 only)
pip install img2table[paddle-gpu]
# For usage with Google Vision OCR
pip install img2table[gcp]
# For usage with AWS Textract OCR
pip install img2table[aws]
# For usage with Azure Cognitive Services OCR
pip install img2table[azure]
Features
- Table identification for images and PDF files, including bounding boxes at the table cell level
- Handling of complex table structures such as merged cells
- Handling of implicit rows - see example
- Table content extraction by providing support for OCR services / tools
- Extracted tables are returned as a simple object, including a Pandas DataFrame representation
- Export extracted tables to an Excel file, preserving their original structure
Supported file formats
Images
Images are loaded using the opencv-python library, supported formats are listed below.
- Windows bitmaps - .bmp, .dib
- JPEG files - .jpeg, .jpg, *.jpe
- JPEG 2000 files - *.jp2
- Portable Network Graphics - *.png
- WebP - *.webp
- Portable image format - .pbm, .pgm, .ppm .pxm, *.pnm
- PFM files - *.pfm
- Sun rasters - .sr, .ras
- TIFF files - .tiff, .tif
- OpenEXR Image files - *.exr
- Radiance HDR - .hdr, .pic
- Raster and Vector geospatial data supported by GDAL
OpenCV: Image file reading and writing
Multi-page images are not supported.
Searchable and non-searchable PDF files are supported.
Usage
Documents
Images
Images are instantiated as follows :
from img2table.document import Image
image = Image(src,
dpi=200,
detect_rotation=False)
Parameters
- src : str,
pathlib.Path, bytes orio.BytesIO, required- Image source
- dpi : int, optional, default
200- Estimated image dpi, used to adapt OpenCV algorithm parameters
- detect_rotation : bool, optional, default
False- Detect and correct skew/rotation of the image
:warning::warning::warning: Disclaimer
The implemented method to handle skewed/rotated images is approximate and might not work on every image.
It is preferable to pass well-oriented images as inputs.
Moreover, when setting the detect_rotation parameter to True, image coordinates and bounding boxes returned by other
methods might not correspond to the original image.
PDF files are instantiated as follows :
from img2table.document import PDF
pdf = PDF(src, dpi=200, pages=[0, 2])
Parameters
- src : str,
pathlib.Path, bytes orio.BytesIO, required- PDF source
- dpi : int, optional, default
200- Dpi used for conversion of PDF pages to images
- pages : list, optional, default
None- List of PDF page indexes to be processed. If None, all pages are processed
OCR
img2table provides an interface for several OCR services and tools in order to parse table content.
If possible (i.e for searchable PDF), PDF text will be extracted directly from the file and the OCR service/tool will not be called.
Tesseract
from img2table.ocr import TesseractOCR
ocr = TesseractOCR(n_threads=1, lang="eng", tessdata_dir="...")
Parameters
- n_threads : int, optional, default
1- Number of concurrent threads used to call Tesseract
- lang : str, optional, default
"eng"- Lang parameter used in Tesseract for text extraction
- tessdata_dir : str, optional, default
None- Directory containing Tesseract traineddata files. If None, the
TESSDATA_PREFIXenv variable is used.
Usage of Tesseract-OCR requires prior installation. Check documentation for instructions.
PaddleOCR
Available for Python versions <= 3.10
PaddleOCR is an open-source OCR based on Deep Learning models.
At first use, relevant languages models will be downloaded.
from img2table.ocr import PaddleOCR
ocr = PaddleOCR(lang="en")
Parameters
- lang : str, optional, default
"en"- Lang parameter used in Paddle for text extraction, check documentation for available languages
Released in version 0.0.13
Google Vision
Authentication to GCP can be done by setting the standard GOOGLE_APPLICATION_CREDENTIALS environment variable.
If this variable is missing, an API key should be provided via the api_key parameter.
from img2table.ocr import VisionOCR
ocr = VisionOCR(api_key="api_key", timeout=15)
Parameters
- api_key : str, optional, default
None- Google Vision API key
- timeout : int, optional, default
15- API requests timeout, in seconds
AWS Textract
When using AWS Textract, the DetectDocumentText API is exclusively called.
Authentication to AWS can be done by passing credentials to the TextractOCR class.
If credentials are not provided, authentication is done using environment variables or configuration files.
Check boto3 documentation for more details.
from img2table.ocr import TextractOCR
ocr = TextractOCR(aws_access_key_id="***",
aws_secret_access_key="***",
aws_session_token="***",
region="eu-west-1")
Parameters
- aws_access_key_id : str, optional, default
None- AWS access key id
- aws_secret_access_key : str, optional, default
None- AWS secret access key
- aws_session_token : str, optional, default
None- AWS temporary session token
- region : str, optional, default
None- AWS server region
Azure Cognitive Services
from img2table.ocr import AzureOCR
ocr = AzureOCR(endpoint="abc.azure.com",
subscription_key="***")
Parameters
- endpoint : str, optional, default
None- Azure Cognitive Services endpoint. If None, inferred from the
COMPUTER_VISION_ENDPOINTenvironment variable.- subscription_key : str, optional, default
None- Azure Cognitive Services subscription key. If None, inferred from the
COMPUTER_VISION_SUBSCRIPTION_KEYenvironment variable.
Table extraction
Multiple tables can be extracted at once from a PDF page/ an image using the extract_tables method of a document.
from img2table.ocr import TesseractOCR
from img2table.document import Image
# Instantiation of OCR
ocr = TesseractOCR(n_threads=1, lang="eng")
# Instantiation of document, either an image or a PDF
doc = Image(src, dpi=200)
# Table extraction
extracted_tables = doc.extract_tables(ocr=ocr,
implicit_rows=True,
borderless_tables=False,
min_confidence=50)
Parameters
- ocr : OCRInstance, optional, default
None- OCR instance used to parse document text. If None, cells content will not be extracted
- implicit_rows : bool, optional, default
True- Boolean indicating if implicit rows should be identified - check related example
- borderless_tables : bool, optional, default
False- Boolean indicating if borderless tables are extracted. It requires to provide an OCR to the method in order to be performed - feature in alpha version
- min_confidence : int, optional, default
50- Minimum confidence level from OCR in order to process text, from 0 (worst) to 99 (best)
Borderless table extraction released in version 0.0.14
NB: the implemented method for extraction of borderless tables heavily relies on OCR quality. In order to achieve decent results, it is recommended to use PaddleOCR or one of the supported commercial solutions.
Method return
The ExtractedTable class is used to model extracted tables from documents.
Attributes
Images
extract_tables method from the Image class returns a list of ExtractedTable objects.
output = [ExtractedTable(...), ExtractedTable(...), ...]
extract_tables method from the PDF class returns an OrderedDict object with page indexes as keys and lists of ExtractedTable objects.
output = {
0: [ExtractedTable(...), ...],
1: [],
...
last_page: [ExtractedTable(...), ...]
}
Excel export
Tables extracted from a document can be exported to a xlsx file. The resulting file is composed of one worksheet per extracted table.
Method arguments are mostly common with the extract_tables method.
from img2table.ocr import TesseractOCR
from img2table.document import Image
# Instantiation of OCR
ocr = TesseractOCR(n_threads=1, lang="eng")
# Instantiation of document, either an image or a PDF
doc = Image(src, dpi=200)
# Extraction of tables and creation of an xlsx file containing tables
doc.to_xlsx(dest=dest,
ocr=ocr,
implicit_rows=True,
min_confidence=50)
Parameters
- dest : str,
pathlib.Pathorio.BytesIO, required- Destination for xlsx file
- ocr : OCRInstance, optional, default
None- OCR instance used to parse document text. If None, cells content will not be extracted
- implicit_rows : bool, optional, default
True- Boolean indicating if implicit rows should be identified - check related example
- min_confidence : int, optional, default
50- Minimum confidence level from OCR in order to process text, from 0 (worst) to 99 (best)
Returns
If aio.BytesIObuffer is passed as dest arg, it is returned containing xlsx data
Examples
Several Jupyter notebooks with examples are available :
- Basic usage: generic library usage, including examples with images, PDF and OCRs
- Borderless tables: specific examples dedicated to the extraction of borderless tables
-
Implicit rows: illustrated effect
of the parameter
implicit_rowsof theextract_tablesmethod
Caveats / FYI
- For table extraction, results are highly dependent on OCR quality. By design, tables where no OCR data can be found are not returned.
- The library is tailored for usage on documents with white/light background. Effectiveness can not be guaranteed on other type of documents.
-
Borderless tables extraction is still in alpha stage and might be inconsistent on complex cases.
As an example, tables with multi-lines cells can cause troubles.
Improvements to the algorithm will be released in future versions.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file img2table-0.0.18.tar.gz.
File metadata
- Download URL: img2table-0.0.18.tar.gz
- Upload date:
- Size: 1.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6d8c0a3872f3e471fda6b3fe5ce8ad5d5454b5f5365facf3a7dd07a5916faae3
|
|
| MD5 |
6f9463885f418d71130cd64fda3a615b
|
|
| BLAKE2b-256 |
52cb39ad64b68f245b50bc2be0995e0605c70d7a4cf6d922317a9ba9b6e8f689
|
File details
Details for the file img2table-0.0.18-py3-none-any.whl.
File metadata
- Download URL: img2table-0.0.18-py3-none-any.whl
- Upload date:
- Size: 62.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dfd3c331e3cdc5adbbbb202fc32ab4811c4241974e915861e87297029aa6774e
|
|
| MD5 |
f69dd5602fe4c78c222bb5d385bb91b2
|
|
| BLAKE2b-256 |
b7af715ea10a4423544894a63d803ef94e5e794c4cceddd3ec6fc098db223ccb
|