Skip to main content

Extract text from tables in images.

Project description

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
TABLE DETECTION IN IMAGES AND OCR TO CSV

Eric Ihli
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━


Table of Contents
─────────────────

1. Overview
2. Requirements
3. Demo
4. Modules





1 Overview
══════════

This python package contains modules to help with finding and
extracting tabular data from a PDF or image into a CSV format.

Given an image that contains a table…

<file:resources/examples/example-page.png>

Extract the the text into a CSV format…

┌────
│ PRIZE,ODDS 1 IN:,# OF WINNERS*
│ $3,9.09,"282,447"
│ $5,16.66,"154,097"
│ $7,40.01,"64,169"
│ $10,26.67,"96,283"
│ $20,100.00,"25,677"
│ $30,290.83,"8,829"
│ $50,239.66,"10,714"
│ $100,919.66,"2,792"
│ $500,"6,652.07",386
│ "$40,000","855,899.99",3
│ 1,i223,
│ Toa,,
│ ,,
│ ,,"* Based upon 2,567,700"
└────


2 Requirements
══════════════

Along with the python requirements that are listed in setup.py and
that are automatically installed when installing this package through
pip, there are a few external requirements for some of the modules.

I haven’t looked into the minimum required versions of these
dependencies, but I’ll list the versions that I’m using.

• `pdfimages' 20.09.0 of [Poppler]
• `tesseract' 5.0.0 of [Tesseract]
• `mogrify' 7.0.10 of [ImageMagick]


[Poppler] <https://poppler.freedesktop.org/>

[Tesseract] <https://github.com/tesseract-ocr/tesseract>

[ImageMagick] <https://imagemagick.org/index.php>


3 Demo
══════

There is a demo module that will download an image given a URL and try
to extract tables from the image and process the cells into a CSV. You
can try it out with one of the images included in this repo.

1. `pip3 install table_ocr'
2. `python3 -m table_ocr.demo
https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png'

That will run against the following image:

<file:resources/test_data/simple.png>

The following should be printed to your terminal after running the
above commands.

┌────
│ Running `extract_tables.main([/tmp/demo_p9on6m8o/simple.png]).`
│ Extracted the following tables from the image:
│ [('/tmp/demo_p9on6m8o/simple.png', ['/tmp/demo_p9on6m8o/simple/table-000.png'])]
│ Processing tables for /tmp/demo_p9on6m8o/simple.png.
│ Processing table /tmp/demo_p9on6m8o/simple/table-000.png.
│ Extracted 18 cells from /tmp/demo_p9on6m8o/simple/table-000.png
│ Cells:
│ /tmp/demo_p9on6m8o/simple/cells/000-000.png: Cell
│ /tmp/demo_p9on6m8o/simple/cells/000-001.png: Format
│ /tmp/demo_p9on6m8o/simple/cells/000-002.png: Formula
│ ...

│ Here is the entire CSV output:

│ Cell,Format,Formula
│ B4,Percentage,None
│ C4,General,None
│ D4,Accounting,None
│ E4,Currency,"=PMT(B4/12,C4,D4)"
│ F4,Currency,=E4*C4
└────


4 Modules
═════════

The package is split into modules with narrow focuses.

• `pdf_to_images' uses Poppler and ImageMagick to extract images from
a PDF.
• `extract_tables' finds and extracts table-looking things from an
image.
• `extract_cells' extracts and orders cells from a table.
• `ocr_image' uses Tesseract to OCR the text from an image of a cell.
• `ocr_to_csv' converts into a CSV the directory structure that
`ocr_image' outputs.

The outputs of a previous module can be used by a subsequent module so
that they can be chained together to create the entire workflow, as
demonstrated by the following shell script.

┌────
│ #!/bin/sh

│ PDF=$1

│ python -m table_ocr.pdf_to_images $PDF | grep .png > /tmp/pdf-images.txt
│ cat /tmp/pdf-images.txt | xargs -I{} python -m table_ocr.extract_tables {} | grep table > /tmp/extracted-tables.txt
│ cat /tmp/extracted-tables.txt | xargs -I{} python -m table_ocr.extract_cells {} | grep cells > /tmp/extracted-cells.txt
│ cat /tmp/extracted-cells.txt | xargs -I{} python -m table_ocr.ocr_image {}

│ for image in $(cat /tmp/extracted-tables.txt); do
│ dir=$(dirname $image)
│ python -m table_ocr.ocr_to_csv $(find $dir/cells -name "*.txt")
│ done
└────


The package was written in a [literate programming] style. The source
code at
<https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html>
is meant to act as the documentation and reference material.


[literate programming]
<https://en.wikipedia.org/wiki/Literate_programming>


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

table_ocr-0.2.5.tar.gz (22.1 MB view details)

Uploaded Source

Built Distributions

table_ocr-0.2.5-py3.8.egg (33.4 MB view details)

Uploaded Source

table_ocr-0.2.5-py3-none-any.whl (33.4 MB view details)

Uploaded Python 3

File details

Details for the file table_ocr-0.2.5.tar.gz.

File metadata

  • Download URL: table_ocr-0.2.5.tar.gz
  • Upload date:
  • Size: 22.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.6

File hashes

Hashes for table_ocr-0.2.5.tar.gz
Algorithm Hash digest
SHA256 09dcfd4ec1127479caed4c9992a1ba7008cfacc89b44cf42214f569bb88f73dc
MD5 4d7b0cfe53dd0ceac0e50e298c06d3fe
BLAKE2b-256 0c806825837bd2f8c4d49a19f77ed71106f8635205719b2df476dcf544c27f26

See more details on using hashes here.

File details

Details for the file table_ocr-0.2.5-py3.8.egg.

File metadata

  • Download URL: table_ocr-0.2.5-py3.8.egg
  • Upload date:
  • Size: 33.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.6

File hashes

Hashes for table_ocr-0.2.5-py3.8.egg
Algorithm Hash digest
SHA256 7ad40d6567e89493bae9da84cac5ea46d78671722c267c7c47e7d75bf4371220
MD5 ed35bec75140b2d5acf524e760ed9134
BLAKE2b-256 2c4e4d66e9b99638d28fffe020e68d9c280545b784ae2ccba65f1ac9e2b01801

See more details on using hashes here.

File details

Details for the file table_ocr-0.2.5-py3-none-any.whl.

File metadata

  • Download URL: table_ocr-0.2.5-py3-none-any.whl
  • Upload date:
  • Size: 33.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.6

File hashes

Hashes for table_ocr-0.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 32b94ef262edf96c4c18478254396412188c34ec979fefe9660b59e0cb3d6678
MD5 5be8cf8178fd9c176f1875d742523471
BLAKE2b-256 42a0c389025a6bd08a2ab9ef9f25dce100cff6e219f56c1247c0d261cfda2fe1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page