Skip to main content

Jupyter widget for applying nlp to pdf documents

Project description

PDF Digitizer (It has a back button!!)

A Jupyter-based tool to help parse out structured text from a PDF document and explore the contents.

Requirements

Installation

It is highly recommended that you install this into a clean environment.

Note: layoutparser puts an upper bound on numpy (1.19.3), so if you want to use the Parse Layout button, it's best to install this in an empty environment.

Conda (recommended)

conda create -n ipypdf python pip jupyterlab tesseract -c conda-forge
conda activate ipypdf
python -m pip install ipypdf

No Conda (not tested)

Install Tesseract

python -m venv envs/ipypdf
cd envs/ipypdf/Scripts
activate.bat
python -m pip install jupyterlab 
python -m pip install ipypdf

Development

see DEVELOPMENT.md

Common Issues

  • AutoTools widget keeps saying layoutparser is not installed
    • This is usually a problem with pywin32.
    • Try conda install pywin32
    • Also make sure that numpy is <1.19.3

Usage

ipypdf is build for jupyter lab but should also work in jupyter notebooks.

  1. Launch jupyter lab with jupyter lab
from ipypdf import App
app = App("path/to/your/pdfs", bulk_render=False)
app

Features

Auto-Parser

layoutparser is used to determine the location of textblocks, images, and section headers. There is not currently a way to automatically determine the hierarchical position of these items.

ezgif-3-51d38d81b3

Note: this is 4x speed

Also: This video is out-dated now. The AutoParse button will now attempt to sort all of the nodes. As well as attempting to deduce the 1st level of hierarchical structure.

Table Parsing

image

Cytoscape

Folders, PDF Documents, and Sections have a tab labeled Cytoscape. This runs a tfidf similarity calculation over all nodes beneath the selected item. I.e. if you select the root node, then all defined nodes will be included in the calculation. However, only those with a link to another node will be drawn (this is for speed, may change this in the future).

The color of each node denotes the pdf document it originated from.

image

Selecting a node in the graph will highlight the node in the DocTree. Clicking the node in the DocTree will render the first page of the node. image

Digitizing Utilities

I recommend turning off Show Boxes as this changes pages every time you add a node (working on a better solution)

Each node has a specific set of tools available to use. Here are the tools provided when a Section node is selected. Starting from the left:

  • Add Section Node adds a sub-node of type Section and selects it
  • Add Text Node adds a sub-node of type Text and selects it
  • Add Image Node ...
  • Delete Node Delete the selected node and all of its children

image

Content Selector

Content is extracted from the rendered image. Text is extracted using Optical Character Recognition (OCR). Images don't do any image analysis, they just denote coordinates and page number so that they can be retreived later if need be.

When a Section node is selected, the selection tool will attempt to parse text from the portion of the page selected by the user. This text will overwrite the label assigned to the node.

When a Text node is selected, the selection tool will attempt to parse text from the selected area and append it to the node's content. This is because text blocks are not always perfectly rectangular, and often span multiple pages.

When an Image node is selected, the coordinates of the box are appended to the node's content.

Save Button

This will generate json files for each document in the directory with instructions for regenerating any nodes you have created when you open the tool again. Alternatively, you can just load the json into another script to extract the document structure if all you want is the text and the hierarchy.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ipypdf-0.1.0.tar.gz (32.5 kB view details)

Uploaded Source

Built Distribution

ipypdf-0.1.0-py2.py3-none-any.whl (35.6 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file ipypdf-0.1.0.tar.gz.

File metadata

  • Download URL: ipypdf-0.1.0.tar.gz
  • Upload date:
  • Size: 32.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/0.0.0 pkginfo/1.8.2 readme-renderer/27.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.2.0 keyring/23.4.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.12

File hashes

Hashes for ipypdf-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7a4ec6bbc7a8e8d69dbc60fd8e40f6cba9e578a5fd63972458d63f28768412de
MD5 bfaed6b727d6fdd0743954b9adad639c
BLAKE2b-256 d8a69a98fdc5cdb29abcbb037c2f8f46d6b4105d23820268af649a060f488e71

See more details on using hashes here.

File details

Details for the file ipypdf-0.1.0-py2.py3-none-any.whl.

File metadata

  • Download URL: ipypdf-0.1.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 35.6 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/0.0.0 pkginfo/1.8.2 readme-renderer/27.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.2.0 keyring/23.4.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.12

File hashes

Hashes for ipypdf-0.1.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 f6e6f22aa54693761c62bebdf085407dfbce134656456ccafabc0c908576a848
MD5 dc6a11d7b4bd7f1adc20c154cf7aa9b3
BLAKE2b-256 060979e0172886878fc8c2b42ad4dba212166af6c6da418d1d8ac72266560b73

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page