Jupyter widget for applying nlp to pdf documents
Project description
PDF Digitizer (It has a back button!!)
A Jupyter-based tool to help parse out structured text from a PDF document and explore the contents.
Requirements
- Tesseract
jupyterlab
(notebook also works)
Installation
It is highly recommended that you install this into a clean environment.
Note:
layoutparser
puts an upper bound onnumpy
(1.19.3), so if you want to use theParse Layout
button, it's best to install this in an empty environment.
Conda (recommended)
conda create -n ipypdf python pip jupyterlab tesseract -c conda-forge
conda activate ipypdf
python -m pip install ipypdf
No Conda (not tested)
Install Tesseract
python -m venv envs/ipypdf
cd envs/ipypdf/Scripts
activate.bat
python -m pip install jupyterlab
python -m pip install ipypdf
Development
see DEVELOPMENT.md
Common Issues
- AutoTools widget keeps saying layoutparser is not installed
- This is usually a problem with pywin32.
- Try
conda install pywin32
- Also make sure that numpy is <1.19.3
Usage
ipypdf is build for jupyter lab but should also work in jupyter notebooks.
- Launch jupyter lab with
jupyter lab
from ipypdf import App
app = App("path/to/your/pdfs", bulk_render=False)
app
Features
Auto-Parser
layoutparser
is used to determine the location of textblocks, images, and section headers. There is not currently a way to automatically determine the hierarchical position of these items.
Note: this is 4x speed
Also: This video is out-dated now. The AutoParse button will now attempt to sort all of the nodes. As well as attempting to deduce the 1st level of hierarchical structure.
Table Parsing
Cytoscape
Folders
, PDF Documents
, and Sections
have a tab labeled Cytoscape
. This runs a tfidf similarity calculation over all nodes beneath the selected item. I.e. if you select the root node, then all defined nodes will be included in the calculation. However, only those with a link to another node will be drawn (this is for speed, may change this in the future).
The color of each node denotes the pdf document it originated from.
Selecting a node in the graph will highlight the node in the DocTree
. Clicking the node in the DocTree
will render the first page of the node.
Digitizing Utilities
I recommend turning off
Show Boxes
as this changes pages every time you add a node (working on a better solution)
Each node has a specific set of tools available to use. Here are the tools provided when a Section
node is selected.
Starting from the left:
Add Section Node
adds a sub-node of typeSection
and selects itAdd Text Node
adds a sub-node of typeText
and selects itAdd Image Node
...Delete Node
Delete the selected node and all of its children
Content Selector
Content is extracted from the rendered image. Text is extracted using Optical Character Recognition (OCR). Images don't do any image analysis, they just denote coordinates and page number so that they can be retreived later if need be.
When a Section
node is selected, the selection tool will attempt to parse text from the portion of the page selected by the user. This text will overwrite the label assigned to the node.
When a Text
node is selected, the selection tool will attempt to parse text from the selected area and append it to the node's content. This is because text blocks are not always perfectly rectangular, and often span multiple pages.
When an Image
node is selected, the coordinates of the box are appended to the node's content.
Save Button
This will generate json
files for each document in the directory with instructions for regenerating any nodes you have created when you open the tool again. Alternatively, you can just load the json into another script to extract the document structure if all you want is the text and the hierarchy.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ipypdf-0.1.0.tar.gz
.
File metadata
- Download URL: ipypdf-0.1.0.tar.gz
- Upload date:
- Size: 32.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/0.0.0 pkginfo/1.8.2 readme-renderer/27.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.2.0 keyring/23.4.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7a4ec6bbc7a8e8d69dbc60fd8e40f6cba9e578a5fd63972458d63f28768412de |
|
MD5 | bfaed6b727d6fdd0743954b9adad639c |
|
BLAKE2b-256 | d8a69a98fdc5cdb29abcbb037c2f8f46d6b4105d23820268af649a060f488e71 |
File details
Details for the file ipypdf-0.1.0-py2.py3-none-any.whl
.
File metadata
- Download URL: ipypdf-0.1.0-py2.py3-none-any.whl
- Upload date:
- Size: 35.6 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/0.0.0 pkginfo/1.8.2 readme-renderer/27.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.2.0 keyring/23.4.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f6e6f22aa54693761c62bebdf085407dfbce134656456ccafabc0c908576a848 |
|
MD5 | dc6a11d7b4bd7f1adc20c154cf7aa9b3 |
|
BLAKE2b-256 | 060979e0172886878fc8c2b42ad4dba212166af6c6da418d1d8ac72266560b73 |