Skip to main content

Document Extraction

Project description

DOCUMENT EXTRACTION.

Purpose:

For converting Unstructured OCR documents into Strutured key value pairs.

Required packages:

  • Wand
  • Pytesseract
  • Tesseract
  • Ghost script
  • Imagemagick
  • Open CV
  • Sklearn
  • Keras
  • Tensorflow

Usage

Replace absolute path of pdf in main function of GETO2.0.py

Key technologies used:

  • Deep learning,
  • Ensembled learning

Description of machine learning architectures.

  • DoT-Net: DoT-Net is a novel and innovative CNN architecture to classify and segment the text elements in the document.
  • RFClassifier: RFClassifier is ensembled deep learning architecture used to detect TOC pages with in the document.

Flow Diagram of the frame work

Alt text

CODE FOLLOW:

  • GETO2.0.py is the interface for our framework.
  • Segmentation.py is the module for DoT-Net. This function is used in GETO2.0.py
  • TOCclassifier.py is the module to detect the TOC in the document. This function is used in GETO2.0.py
  • TESSARACT.py is used for extract text entites from detected blocks of text in segmentation.py. This function is used in TOCclassifier.py
  • BlockParsing.py is used to extract TOC entites form TOCs pages detected in TOCclassifier. This function is used in Segementation.py

CODE FLOW:

Alt text

Detail description of code:

GET02.0.py:

GETO2.0 is the main interface of our framework. Each page in input pdf file is converted to image using wand library. This convert image is checked for TOC by using TOCclassifier (We only check for TOC in first N pages).

  • Pages detected as ToC.
    • TOCclassifer.py : TOCclassifier check the pages for TOC. If the page is classified as TOC then we use tesseract.py to extract the Text information for TOC and append in a list.
      • tesseract.py: Tesseract.py uses the pytesseract (python wrapper of tesseract. Tesseract is a text extraction framework from images), for extracting text from TOC.
  • Page detected as Non-ToC.
    • Note: Pages after the first N is also considered as Non-ToC.
    • Segmentation.py : Segmentation does mutiple tasks.
      • It segements the pages by using image morophology methods and counter functions, to find the Conneted Comments (Blocks).
      • A sliding window is passed over these Connected Components to generate 100 * 100 size tiles (DoT-Net takes 100 * 100 tiles as input to classify.
      • A data dulipcation or augmentation is performed on blocks which are less than 100 * 100 (especially for headings the blocks size will be less than 100 * 100), to avoid the data missing issue.
      • Now this is 100 * 100 are classifed using DoT-Net.
      • After patch classification we use majorty voting to predict the label of block.
      • If block label is text. Then we use blockparsing.py to extract the text from blocks.
      • Note: Our DoT-Net can detect other classes such as Table, Image, Mathematical Expressions, and Line drawings, but for this project we are only focused on Text.
      • Blockparsing.py uses pytesseract to extract the text.
      • Append the extracted text in list
  • Text from TOC and remaining PDF document is extarcted and appended in respective lists.
    • After Extracting text from TOC and remaining pdf document and appended in list.
    • we use fuzzy matching and regular expression matchings techniques to create JSON files

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

DoT_Net-0.1.1.tar.gz (2.8 kB view details)

Uploaded Source

Built Distribution

DoT_Net-0.1.1-py3-none-any.whl (3.5 kB view details)

Uploaded Python 3

File details

Details for the file DoT_Net-0.1.1.tar.gz.

File metadata

  • Download URL: DoT_Net-0.1.1.tar.gz
  • Upload date:
  • Size: 2.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.6

File hashes

Hashes for DoT_Net-0.1.1.tar.gz
Algorithm Hash digest
SHA256 a01b1d75940a00377b6c7fda981860c7160d604f584729a322a5c8472f2ad22f
MD5 99b57ee1cb0f38267b70dffe3b8a0550
BLAKE2b-256 7a8fdcbaf02acbe63e97725952ee0a82991a1bdf0d351780946c840824998ed9

See more details on using hashes here.

File details

Details for the file DoT_Net-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: DoT_Net-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 3.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.6

File hashes

Hashes for DoT_Net-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 aa0b8af3d7786d836ae08a42d5bdf2678e8011bce2bfbdeb63a5d29ed9354e56
MD5 61f38d0fb55e67525e799e14ac0a1a80
BLAKE2b-256 585047dbc9b4615a564e595347912651d7449f3c3c9709b512ae07e862489d73

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page