Skip to main content

Jupyter widget for applying nlp to pdf documents

Project description

iPyPDF

A Jupyter-based tool to help parse out structured text from a PDF document and explore the contents.

Installation

Windows Installer

https://drive.google.com/drive/folders/1wmQisECMor04dgv9ZXFc07zq6zcHuija?usp=sharing

This will make a start-menu shortcut called "iPyPDF" which will open up the notebook for parsing documents. Unfortunately the conda installer has some trouble with Tesseract environment variables so you'll need to add this cell to the notebook to get anything to work.

import os
import sys
PREFIX = sys.prefix
os.environ["TESSDATA_PREFIX"] = f"{PREFIX}/share/tessdata"

From Source

  1. Clone this repo
  2. Install Anaconda or Miniconda if you do not already have it
  3. Install mamba conda install mamba
    • Solving the environment is impossibly slow without mamba
  4. Create the environment and install ipypdf from source
mamba env create -f environment.yml -p env/ipypdf
conda activate env/ipypdf
pip install -e .

Note: you can replace "mamba" with "conda" if you don't have mamba installed. It will just take longer to solve the environment.

From pip

  1. Create a conda environment with Tesseract and Jupyterlab
conda create -n ipypdf jupyterlab tesseract -c conda-forge`
conda activate ipypdf
pip install ipypdf
  1. Get a spacy model (the previous method accomplishes this automatically in the environment.yml file)
    1. python -m spacy download en_core_web_sm
    2. Or conda install spacy-model-en_core_web_sm -c conda-forge

Without Conda

You may run into some issues with Tesseract environment variables. It is still recommended that you use a fresh environment, virtualenv for example, to avoid numpy conflicts.

  1. Install Tesseract
  2. Create and activate a fresh environment (highly recommended)
  3. Install ipypdf and jupyterlab
pip install jupyterlab
pip install ipypdf
python -m spacy download en_core_web_sm

Usage

ipypdf is built for jupyter lab but should also work in jupyter notebooks.

  1. Launch jupyter lab with jupyter lab
from ipypdf import App
app = App("path/to/your/pdfs", bulk_render=False)
app

see notebooks for additional info

Development

see DEVELOPMENT.md

Common Issues

  • AutoTools widget keeps saying layoutparser is not installed
    • This is usually a problem with pywin32.
    • Try conda install pywin32
    • Also make sure that numpy is <1.19.3

Features

Within the GUI are 3 panels, Table of Contents, PDF viewer, and Tools. In this section we are going over all of the various options available in the tools panel.

Auto-Tools

This tab contains tools which will iterate through each page of the pdf.

  • Text Only: Runs each page through Tesseract to obtain plain text.
  • Parse Layout: Uses layoutparser to label portions of the document as either (title, text, image, or table). The sections are then assembled together using a few simple rules in order to appoximate a shallow content hierarchy.
    • Title and Text blocks are cropped out and sent through Tesseract to obtain the text.
    • Tables are processed using a rule-based table parsing scheme described here.
    • Image blocks have no additional processing.

img

Notice that section 3 is missing. The process is not perfect. In this case, a section title was mislabled by layoutparser as standard text. Mistakes like this are fairly common. To correct them, you can edit the table of contents using the arrow keys (the cursor must be hovering over the table of contents).

Table Parsing

image

Cytoscape

Folders, PDF Documents, and Sections have a tab labeled Cytoscape. This runs a tfidf similarity calculation over all nodes beneath the selected item. I.e. if you select the root node, then all defined nodes will be included in the calculation. However, only those with a link to another node will be drawn (this is for speed, may change this in the future).

The color of each node denotes the pdf document it originated from.

image

Selecting a node in the graph will highlight the node in the DocTree. Clicking the node in the DocTree will render the first page of the node. image

Spacy

Extracts named entities from the selected branch of the document tree. I.e., the raw text is compiled from a depth first search on whichever node is selected in the table of contents. Then, spacy.nlp(text).ents returns the named entities found within the section.

image

Digitizing Utilities

I recommend turning off Show Boxes as this changes pages every time you add a node (working on a better solution)

Each node has a specific set of tools available to use. Here are the tools provided when a Section node is selected. Starting from the left:

  • Add Section Node adds a sub-node of type Section and selects it
  • Add Text Node adds a sub-node of type Text and selects it
  • Add Image Node ...
  • Delete Node Delete the selected node and all of its children

image

Content Selector

Content is extracted from the rendered image. Text is extracted using Optical Character Recognition (OCR). Images don't do any image analysis, they just denote coordinates and page number so that they can be retreived later if need be.

When a Section node is selected, the selection tool will attempt to parse text from the portion of the page selected by the user. This text will overwrite the label assigned to the node.

When a Text node is selected, the selection tool will attempt to parse text from the selected area and append it to the node's content. This is because text blocks are not always perfectly rectangular, and often span multiple pages.

When an Image node is selected, the coordinates of the box are appended to the node's content.

Save Button

This will generate json files for each document. When the tool is initialized, these are used to reconstruct the table of contents. You can also use the json file directly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ipypdf-0.1.5.tar.gz (40.5 kB view details)

Uploaded Source

Built Distributions

ipypdf-0.1.5-py3-none-any.whl (41.0 kB view details)

Uploaded Python 3

ipypdf-0.1.5-py2.py3-none-any.whl (41.2 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file ipypdf-0.1.5.tar.gz.

File metadata

  • Download URL: ipypdf-0.1.5.tar.gz
  • Upload date:
  • Size: 40.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.8.13

File hashes

Hashes for ipypdf-0.1.5.tar.gz
Algorithm Hash digest
SHA256 72023af2b062fba144555e907f33a3b6e25f1ef8fb5e3766261d848779e22ec4
MD5 e529f965dc5d09f00e938253523f9545
BLAKE2b-256 22f84fe8b6a3d3113be633646752dabc328594d2f7b5841d3e30e2b0f8fda75e

See more details on using hashes here.

File details

Details for the file ipypdf-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: ipypdf-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 41.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.13

File hashes

Hashes for ipypdf-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 dc95386cad44adb6f656c1dc4513ba22d9fe4ac53a770d6ddf92766300d2a775
MD5 134dafcf453e477594320063e4b6d862
BLAKE2b-256 8714d916219cfcde24fffa012df5680aa474cd02c296380b4fdc9b1cb9ece333

See more details on using hashes here.

File details

Details for the file ipypdf-0.1.5-py2.py3-none-any.whl.

File metadata

  • Download URL: ipypdf-0.1.5-py2.py3-none-any.whl
  • Upload date:
  • Size: 41.2 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.8.13

File hashes

Hashes for ipypdf-0.1.5-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 e384597be23a72542abf26638814a17052f4df14d18fa90d48befe0d1ddf2fae
MD5 1bd4c740429c531d52d0702cf8289caf
BLAKE2b-256 d8b25f99985b0eec95e9434f4afaa23fb07578eff18a6ef8b21a3ea758bbe6ba

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page