Skip to main content

TabularOCR is a Python library that provides an easy-to-use Optical Character Recognition (OCR) solution for extracting tables from images and PDFs. It offers flexible output options, allowing you to export the extracted data in CSV, XLSX, or other spreadsheet formats.

Project description

https://img.shields.io/pypi/v/tableocr.svg Documentation Status

TableOCR is a powerful and versatile Python library that provides an easy-to-use Optical Character Recognition (OCR) solution for extracting tables from images and PDFs. It offers flexible output options, allowing you to export the extracted data in CSV, XLSX, or other spreadsheet formats.

Features

  • Accurate Table Detection: TableOCR uses advanced computer vision algorithms to accurately detect and extract tables from images and PDFs, even in challenging scenarios with complex layouts or low-quality scans. It employs techniques such as edge detection, connected component analysis, and deep learning-based object detection to locate and isolate tables within the input document.

  • Multiple Input Formats: Supports a wide range of input formats, including PNG, JPG, BMP, TIFF, and PDF files, allowing for flexibility in processing various types of document sources.

  • Customizable Output: Offers flexible output options, allowing you to export the extracted data in CSV, XLSX, or other spreadsheet formats of your choice, ensuring seamless integration with your existing data processing workflows.

  • Batch Processing: Easily process multiple files in a directory or a folder structure, making it ideal for high-volume data extraction tasks, such as digitizing large archives or processing scanned documents at scale.

  • Multi-language Support: Leverages state-of-the-art OCR engines to support a wide range of languages, enabling accurate table extraction from documents in various languages, including English, Spanish, French, German, Chinese, Arabic, and many more.

  • Parallel Processing: Utilizes multi-threading and parallel processing capabilities to speed up the table extraction process, significantly reducing processing times for large datasets or complex documents.

  • Configurable Settings: Provides a range of configuration options to fine-tune the table extraction process, including options for adjusting image pre-processing (e.g., deskewing, denoising, and binarization), OCR engine settings (e.g., language packs, character whitelists), and output formatting (e.g., column delimiters, date formats).

  • Embedded OCR Engines: TableOCR comes bundled with several popular OCR engines, including Tesseract and LSTM-based models, ensuring high accuracy and flexibility in table extraction. Additional OCR engines can be easily integrated, thanks to the modular design of the library.

  • Seamless Integration: Designed with a user-friendly API, TableOCR can be easily integrated into your existing Python projects, allowing for efficient table data extraction and analysis workflows, enabling applications in areas such as data mining, research, and business intelligence.

Installation

TableOCR can be installed from PyPI using pip:

pip install tableocr

Usage

Here’s a simple example of how to use TableOCR to extract tables from an image file:

from tableocr import TableOCR

# Initialize the TableOCR instance
ocr = TableOCR()

# Path to the input image or PDF file
image_path = "path/to/image.png"

# Extract tables from the image
tables = ocr.extract(image_path)

# Export the extracted tables to a CSV file
ocr.to_csv("output.csv", tables)

Usage Examples

  1. Batch Processing:

TableOCR supports batch processing of multiple files in a directory or folder structure. Here’s an example:

from tableocr import TableOCR
import os

# Initialize the TableOCR instance
ocr = TableOCR()

# Directory containing input files
input_dir = "path/to/input/directory"

# Iterate over files in the directory
for filename in os.listdir(input_dir):
    file_path = os.path.join(input_dir, filename)
    tables = ocr.extract(file_path)

    # Export tables to individual CSV files
    output_file = f"output_{filename}.csv"
    ocr.to_csv(output_file, tables)
  1. Configuring OCR Settings:

You can fine-tune the OCR engine settings to optimize performance for specific document types or languages:

from tableocr import TableOCR, OCRSettings

# Initialize the TableOCR instance
ocr = TableOCR()

# Configure OCR settings
settings = OCRSettings(language="fra", whitelist="0123456789")
ocr.set_ocr_settings(settings)

# Extract tables using the configured settings
image_path = "path/to/image.png"
tables = ocr.extract(image_path)
  1. Customizing Output Formatting:

TableOCR allows you to customize the output format by specifying column delimiters, date formats, and other formatting options:

from tableocr import TableOCR, OutputSettings

# Initialize the TableOCR instance
ocr = TableOCR()

# Configure output settings
output_settings = OutputSettings(delimiter="|", date_format="%Y-%m-%d")
ocr.set_output_settings(output_settings)

# Extract tables and export to CSV with custom settings
image_path = "path/to/image.png"
tables = ocr.extract(image_path)
ocr.to_csv("output.csv", tables)

For more advanced usage, such as handling PDF files, table structure analysis, and table merging, refer to the documentation.

Contributing

Contributions to TableOCR are welcome! If you encounter any issues or have ideas for improvements, please open an issue or submit a pull request on the GitHub repository.

Credits

TableOCR was created and is maintained by Salim Benhamadi.

License

This project is licensed under the terms of the MIT license.

History

0.1.0 (2024-03-07)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

TabularOCR-0.1.0.tar.gz (10.8 kB view details)

Uploaded Source

Built Distribution

TabularOCR-0.1.0-py2.py3-none-any.whl (5.0 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file TabularOCR-0.1.0.tar.gz.

File metadata

  • Download URL: TabularOCR-0.1.0.tar.gz
  • Upload date:
  • Size: 10.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.12

File hashes

Hashes for TabularOCR-0.1.0.tar.gz
Algorithm Hash digest
SHA256 cf7bbdeaeb78fb93070549a4c160aeae9391cf6b6c32c802edab4146642ea2f1
MD5 d00d470cbe4f91b39c6877ca9844d40a
BLAKE2b-256 dc23e82f3a7c31a18faff2f767ed7a4f7deccf5ab8e2ff1f803b486f2696b69c

See more details on using hashes here.

File details

Details for the file TabularOCR-0.1.0-py2.py3-none-any.whl.

File metadata

  • Download URL: TabularOCR-0.1.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 5.0 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.12

File hashes

Hashes for TabularOCR-0.1.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 6f4493b6e80bb7f701a7885dee6d0b28f132a8ecc0c231728c13e05fdefd701d
MD5 76cdf5bd7de05ad211984608377cef7d
BLAKE2b-256 5ab606ad408f816e963296277d0e7f65ab8587e0f744222f456c6bce348dc31a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page