Python library for document processing

These details have not been verified by PyPI

Project links

Project description

Inkwell

Quickstart on Colab

Overview

Inkwell is a modular Python library for extracting information from PDF documents documents with state of the art Vision Language Models. We make use of layout understanding models to improve accuracy of Vision Language models.

Inkwell uses the following models, with more integrations in the work

Layout Detection: Faster RCNN, LayoutLMv3, Paddle
Table Detection: Table Transformer
Table Data Extraction: Phi3.5-Vision, Qwen2 VL 2B, Table Transformer, OpenAI GPT4o Mini
OCR: Tesseract, PaddleOCR, Phi3.5-Vision, Qwen2 VL 2B

Installation

pip install py-inkwell[inference]

In addition, install detectron2

pip install git+https://github.com/facebookresearch/detectron2.git

Install Tesseract

For Ubuntu -

sudo apt install tesseract-ocr
sudo apt install libtesseract-dev

and, Mac OS

brew install tesseract

For GPUs, install flash attention and vllm for faster inference.

pip install flash-attn --no-build-isolation
pip install vllm

Basic Usage

Parse Pages

from inkwell.pipeline import Pipeline

pipeline = Pipeline()
document = pipeline.process("/path/to/file.pdf")

Extract Page Elements

pages = document.pages

Every Page has the following fragment objects -

Figures
Tables
Text

Figures

Each figure fragment's content has the following attributes -

bbox - The bounding box of the figure
text - The text in the figure, extracted using OCR
image - The cropped image of the figure

figures = page.figure_fragments()

for figure in figures:
    figure_image = figure.content.image 
    figure_bbox = figure.content.bbox 
    figure_text = figure.content.text

Table

Each table fragment's content has the following attributes -

data - The data in the table, extracted using Table Extractor
bbox - The bounding box of the table
image - The image of the table, extracted using OCR

tables = page.table_fragments()

for table in tables:
    table_data = table.content.data
    table_bbox = table.content.bbox
    table_image = table.content.image

Text

Each text fragment's content has the following attributes -

text - The text in the text block
bbox - The bounding box of the text block
image - The image of the text block

text_blocks = page.text_fragments()

for text_block in text_blocks:
    text_block_text = text_block.content.text
    text_block_bbox = text_block.content.bbox
    text_block_image = text_block.content.image

Complete Example

We will take the following PDF and extract text, tables and images from this separtely.

from inkwell.pipeline import Pipeline

pipeline = Pipeline()
document = pipeline.process("/path/to/file.pdf")
pages = document.pages

for page in pages:

    figures = page.figure_fragments()
    tables = page.table_fragments()
    text_blocks = page.text_fragments()

    # Check the content of the image fragments
    for figure in figures:
        figure_image = figure.content.image
        figure_text = figure.content.text
    
    # Check the content of the table fragments
    for table in tables:
        table_image = table.content.image
        table_data = table.content.data

    # Check the content of the text blocks
    for text_block in text_blocks:
        text_block_image = text_block.content.image
        text_block_text = text_block.content.text

Using Qwen2/Phi3.5/OpenAI Vision Models

We have defined a default config class here. You can add vision-language models to the config to use them instead of the default models.

from inkwell.pipeline import DefaultPipelineConfig, Pipeline
from inkwell.ocr import OCRType
from inkwell.table_extractor import TableExtractorType

# using Qwen2 2B Vision OCR anf Table Extractor
config = DefaultPipelineConfig(
    ocr_detector=OCRType.QWEN2_2B_VISION,
    table_extractor=TableExtractorType.QWEN2_2B_VISION
) 

# using Phi3.5 Vision OCR and Table Extractor
config = DefaultPipelineConfig(
    ocr_detector=OCRType.PHI3_VISION,
    table_extractor=TableExtractorType.PHI3_VISION
) 

# using OpenAI GPT4o Mini OCR and Table Extractor (Requires API Key)
config = DefaultPipelineConfig(
    ocr_detector=OCRType.OPENAI_GPT4O_MINI,
    table_extractor=TableExtractorType.OPENAI_GPT4O_MINI
) 

pipeline = Pipeline(config=config)

Advanced Customizations

You can add custom detectors and other components to the pipeline yourself - follow the instructions in the Custom Components notebook

Acknowledgements

We derived inspiration from several open-source libraries in our implementation, like Layout Parser and Deepdoctection. We would like to thank the contributors to these libraries for their work.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.39

Oct 31, 2024

0.0.38

Oct 31, 2024

0.0.37

Oct 26, 2024

0.0.36

Oct 26, 2024

0.0.35

Oct 25, 2024

0.0.34

Oct 25, 2024

0.0.33

Oct 24, 2024

0.0.32

Oct 24, 2024

0.0.31

Oct 24, 2024

0.0.30

Oct 23, 2024

This version

0.0.29

Oct 23, 2024

0.0.28

Oct 22, 2024

0.0.27

Oct 22, 2024

0.0.26

Oct 22, 2024

0.0.25

Oct 22, 2024

0.0.24

Oct 22, 2024

0.0.23

Oct 18, 2024

0.0.22

Oct 18, 2024

0.0.21

Oct 17, 2024

0.0.20

Oct 17, 2024

0.0.19

Oct 17, 2024

0.0.18

Oct 17, 2024

0.0.17

Oct 14, 2024

0.0.16

Oct 14, 2024

0.0.15

Oct 10, 2024

0.0.14

Oct 9, 2024

0.0.12

Oct 9, 2024

0.0.11

Oct 4, 2024

0.0.10

Oct 3, 2024

0.0.9

Oct 1, 2024

0.0.8

Sep 30, 2024

0.0.7

Sep 30, 2024

0.0.6

Sep 28, 2024

0.0.5

Sep 27, 2024

0.0.3

Sep 25, 2024

0.0.2

Sep 25, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py_inkwell-0.0.29.tar.gz (19.2 MB view details)

Uploaded Oct 23, 2024 Source

Built Distribution

py_inkwell-0.0.29-py3-none-any.whl (19.2 MB view details)

Uploaded Oct 23, 2024 Python 3

File details

Details for the file py_inkwell-0.0.29.tar.gz.

File metadata

Download URL: py_inkwell-0.0.29.tar.gz
Upload date: Oct 23, 2024
Size: 19.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.4 CPython/3.10.12 Linux/6.2.0-37-generic

File hashes

Hashes for py_inkwell-0.0.29.tar.gz
Algorithm	Hash digest
SHA256	`9e7599c47abff6da83920d31f3ee677be9ecb3ddba6c1f34f61c191aa416cbff`
MD5	`b3df07604f324e1c49e866adc657d089`
BLAKE2b-256	`0a17812ae8746cee8dd7bb024606369e5aea642fcde14d081ec1e05b0ad51d36`

See more details on using hashes here.

File details

Details for the file py_inkwell-0.0.29-py3-none-any.whl.

File metadata

Download URL: py_inkwell-0.0.29-py3-none-any.whl
Upload date: Oct 23, 2024
Size: 19.2 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.4 CPython/3.10.12 Linux/6.2.0-37-generic

File hashes

Hashes for py_inkwell-0.0.29-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b15b140ed5c8d9ffcd4b3f6fd4eeb57ad7be1d24b37fe6965ccc37a25f9a59a8`
MD5	`39794d09441d905117b7167a69ea3b61`
BLAKE2b-256	`f1ec21fccbeb7078264aba410675779f7842a209663bb0fc8167458442ef4c6a`