table-ocr

Extract text from tables in images.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
                TABLE DETECTION IN IMAGES AND OCR TO CSV

                               Eric Ihli
               ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━


Table of Contents
─────────────────

1. Overview
2. Requirements
3. Demo
4. Modules





1 Overview
══════════

  This python package contains modules to help with finding and
  extracting tabular data from a PDF or image into a CSV format.

  Given an image that contains a table…

  <file:resources/examples/example-page.png>

  Extract the the text into a CSV format…

  ┌────
  │ PRIZE,ODDS 1 IN:,# OF WINNERS*
  │ $3,9.09,"282,447"
  │ $5,16.66,"154,097"
  │ $7,40.01,"64,169"
  │ $10,26.67,"96,283"
  │ $20,100.00,"25,677"
  │ $30,290.83,"8,829"
  │ $50,239.66,"10,714"
  │ $100,919.66,"2,792"
  │ $500,"6,652.07",386
  │ "$40,000","855,899.99",3
  │ 1,i223,
  │ Toa,,
  │ ,,
  │ ,,"* Based upon 2,567,700"
  └────


2 Requirements
══════════════

  Along with the python requirements that are listed in setup.py and
  that are automatically installed when installing this package through
  pip, there are a few external requirements for some of the modules.

  I haven’t looked into the minimum required versions of these
  dependencies, but I’ll list the versions that I’m using.

  • `pdfimages' 20.09.0 of [Poppler]
  • `tesseract' 5.0.0 of [Tesseract]
  • `mogrify' 7.0.10 of [ImageMagick]


[Poppler] <https://poppler.freedesktop.org/>

[Tesseract] <https://github.com/tesseract-ocr/tesseract>

[ImageMagick] <https://imagemagick.org/index.php>


3 Demo
══════

  There is a demo module that will download an image given a URL and try
  to extract tables from the image and process the cells into a CSV. You
  can try it out with one of the images included in this repo.

  1. `pip3 install table_ocr'
  2. `python3 -m table_ocr.demo
     https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png'

  That will run against the following image:

  <file:resources/test_data/simple.png>

  The following should be printed to your terminal after running the
  above commands.

  ┌────
  │ Running `extract_tables.main([/tmp/demo_p9on6m8o/simple.png]).`
  │ Extracted the following tables from the image:
  │ [('/tmp/demo_p9on6m8o/simple.png', ['/tmp/demo_p9on6m8o/simple/table-000.png'])]
  │ Processing tables for /tmp/demo_p9on6m8o/simple.png.
  │ Processing table /tmp/demo_p9on6m8o/simple/table-000.png.
  │ Extracted 18 cells from /tmp/demo_p9on6m8o/simple/table-000.png
  │ Cells:
  │ /tmp/demo_p9on6m8o/simple/cells/000-000.png: Cell
  │ /tmp/demo_p9on6m8o/simple/cells/000-001.png: Format
  │ /tmp/demo_p9on6m8o/simple/cells/000-002.png: Formula
  │ ...
  │ 
  │ Here is the entire CSV output:
  │ 
  │ Cell,Format,Formula
  │ B4,Percentage,None
  │ C4,General,None
  │ D4,Accounting,None
  │ E4,Currency,"=PMT(B4/12,C4,D4)"
  │ F4,Currency,=E4*C4
  └────


4 Modules
═════════

  The package is split into modules with narrow focuses.

  • `pdf_to_images' uses Poppler and ImageMagick to extract images from
    a PDF.
  • `extract_tables' finds and extracts table-looking things from an
    image.
  • `extract_cells' extracts and orders cells from a table.
  • `ocr_image' uses Tesseract to OCR the text from an image of a cell.
  • `ocr_to_csv' converts into a CSV the directory structure that
    `ocr_image' outputs.

  The outputs of a previous module can be used by a subsequent module so
  that they can be chained together to create the entire workflow, as
  demonstrated by the following shell script.

  ┌────
  │ #!/bin/sh
  │ 
  │ PDF=$1
  │ 
  │ python -m table_ocr.pdf_to_images $PDF | grep .png > /tmp/pdf-images.txt
  │ cat /tmp/pdf-images.txt | xargs -I{} python -m table_ocr.extract_tables {}  | grep table > /tmp/extracted-tables.txt
  │ cat /tmp/extracted-tables.txt | xargs -I{} python -m table_ocr.extract_cells {} | grep cells > /tmp/extracted-cells.txt
  │ cat /tmp/extracted-cells.txt | xargs -I{} python -m table_ocr.ocr_image {}
  │ 
  │ for image in $(cat /tmp/extracted-tables.txt); do
  │     dir=$(dirname $image)
  │     python -m table_ocr.ocr_to_csv $(find $dir/cells -name "*.txt")
  │ done
  └────


  The package was written in a [literate programming] style. The source
  code at
  <https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html>
  is meant to act as the documentation and reference material.


[literate programming]
<https://en.wikipedia.org/wiki/Literate_programming>

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.2.5

Dec 28, 2020

0.2.4

Oct 19, 2020

0.2.3

Oct 19, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

table_ocr-0.2.5.tar.gz (22.1 MB view details)

Uploaded Dec 28, 2020 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

table_ocr-0.2.5-py3.8.egg (33.4 MB view details)

Uploaded Dec 28, 2020 Egg

table_ocr-0.2.5-py3-none-any.whl (33.4 MB view details)

Uploaded Dec 28, 2020 Python 3

File details

Details for the file table_ocr-0.2.5.tar.gz.

File metadata

Download URL: table_ocr-0.2.5.tar.gz
Upload date: Dec 28, 2020
Size: 22.1 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.6

File hashes

Hashes for table_ocr-0.2.5.tar.gz
Algorithm	Hash digest
SHA256	`09dcfd4ec1127479caed4c9992a1ba7008cfacc89b44cf42214f569bb88f73dc`
MD5	`4d7b0cfe53dd0ceac0e50e298c06d3fe`
BLAKE2b-256	`0c806825837bd2f8c4d49a19f77ed71106f8635205719b2df476dcf544c27f26`

See more details on using hashes here.

File details

Details for the file table_ocr-0.2.5-py3.8.egg.

File metadata

Download URL: table_ocr-0.2.5-py3.8.egg
Upload date: Dec 28, 2020
Size: 33.4 MB
Tags: Egg
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.6

File hashes

Hashes for table_ocr-0.2.5-py3.8.egg
Algorithm	Hash digest
SHA256	`7ad40d6567e89493bae9da84cac5ea46d78671722c267c7c47e7d75bf4371220`
MD5	`ed35bec75140b2d5acf524e760ed9134`
BLAKE2b-256	`2c4e4d66e9b99638d28fffe020e68d9c280545b784ae2ccba65f1ac9e2b01801`

See more details on using hashes here.

File details

Details for the file table_ocr-0.2.5-py3-none-any.whl.

File metadata

Download URL: table_ocr-0.2.5-py3-none-any.whl
Upload date: Dec 28, 2020
Size: 33.4 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.8.6

File hashes

Hashes for table_ocr-0.2.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`32b94ef262edf96c4c18478254396412188c34ec979fefe9660b59e0cb3d6678`
MD5	`5be8cf8178fd9c176f1875d742523471`
BLAKE2b-256	`42a0c389025a6bd08a2ab9ef9f25dce100cff6e219f56c1247c0d261cfda2fe1`

See more details on using hashes here.

table-ocr 0.2.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes