Python library for document processing
Project description
Inkwell
Quickstart on Colab
Overview
Inkwell is a modular Python library for extracting information from PDF documents documents with state of the art Vision Language Models. We make use of layout understanding models to improve accuracy of Vision Language models.
Inkwell uses the following models, with more integrations in the work
- Layout Detection: Faster RCNN, LayoutLMv3, Paddle
- Table Detection: Table Transformer
- Table Data Extraction: Phi3.5-Vision, Qwen2 VL 2B, Table Transformer, OpenAI GPT4o Mini
- OCR: Tesseract, PaddleOCR, Phi3.5-Vision, Qwen2 VL 2B
Installation
pip install py-inkwell[inference]
In addition, install detectron2
pip install git+https://github.com/facebookresearch/detectron2.git
Install Tesseract
For Ubuntu -
sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
and, Mac OS
brew install tesseract
For GPUs, install flash attention and vllm for faster inference.
pip install flash-attn --no-build-isolation
pip install vllm
Basic Usage
Parse Pages
from inkwell.pipeline import Pipeline
pipeline = Pipeline()
document = pipeline.process("/path/to/file.pdf")
Extract Page Elements
pages = document.pages
Every Page has the following fragment objects -
- Figures
- Tables
- Text
Figures
Each figure fragment's content has the following attributes -
- bbox - The bounding box of the figure
- text - The text in the figure, extracted using OCR
- image - The cropped image of the figure
figures = page.figure_fragments()
for figure in figures:
figure_image = figure.content.image
figure_bbox = figure.content.bbox
figure_text = figure.content.text
Table
Each table fragment's content has the following attributes -
- data - The data in the table, extracted using Table Extractor
- bbox - The bounding box of the table
- image - The image of the table, extracted using OCR
tables = page.table_fragments()
for table in tables:
table_data = table.content.data
table_bbox = table.content.bbox
table_image = table.content.image
Text
Each text fragment's content has the following attributes -
- text - The text in the text block
- bbox - The bounding box of the text block
- image - The image of the text block
text_blocks = page.text_fragments()
for text_block in text_blocks:
text_block_text = text_block.content.text
text_block_bbox = text_block.content.bbox
text_block_image = text_block.content.image
Complete Example
We will take the following PDF and extract text, tables and images from this separtely.
from inkwell.pipeline import Pipeline
pipeline = Pipeline()
document = pipeline.process("/path/to/file.pdf")
pages = document.pages
for page in pages:
figures = page.figure_fragments()
tables = page.table_fragments()
text_blocks = page.text_fragments()
# Check the content of the image fragments
for figure in figures:
figure_image = figure.content.image
figure_text = figure.content.text
# Check the content of the table fragments
for table in tables:
table_image = table.content.image
table_data = table.content.data
# Check the content of the text blocks
for text_block in text_blocks:
text_block_image = text_block.content.image
text_block_text = text_block.content.text
Using Qwen2/Phi3.5/OpenAI Vision Models
We have defined a default config class here. You can add vision-language models to the config to use them instead of the default models.
from inkwell.pipeline import DefaultPipelineConfig, Pipeline
from inkwell.ocr import OCRType
from inkwell.table_extractor import TableExtractorType
# using Qwen2 2B Vision OCR anf Table Extractor
config = DefaultPipelineConfig(
ocr_detector=OCRType.QWEN2_2B_VISION,
table_extractor=TableExtractorType.QWEN2_2B_VISION
)
# using Phi3.5 Vision OCR and Table Extractor
config = DefaultPipelineConfig(
ocr_detector=OCRType.PHI3_VISION,
table_extractor=TableExtractorType.PHI3_VISION
)
# using OpenAI GPT4o Mini OCR and Table Extractor (Requires API Key)
config = DefaultPipelineConfig(
ocr_detector=OCRType.OPENAI_GPT4O_MINI,
table_extractor=TableExtractorType.OPENAI_GPT4O_MINI
)
pipeline = Pipeline(config=config)
Advanced Customizations
You can add custom detectors and other components to the pipeline yourself - follow the instructions in the Custom Components notebook
Acknowledgements
We derived inspiration from several open-source libraries in our implementation, like Layout Parser and Deepdoctection. We would like to thank the contributors to these libraries for their work.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file py_inkwell-0.0.22.tar.gz
.
File metadata
- Download URL: py_inkwell-0.0.22.tar.gz
- Upload date:
- Size: 19.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.4 CPython/3.10.12 Linux/6.2.0-37-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 570da575b4e0539204b13b50ce6f1da5a621da5eb2f4d4babeaac65fc1cc0494 |
|
MD5 | ced80f17f6e43e964fd835fb681647bc |
|
BLAKE2b-256 | 7b38b0ca598b23fe43439aa3a3f5e06acc8ca0d2db163e30db1fa1d77909e6e1 |
File details
Details for the file py_inkwell-0.0.22-py3-none-any.whl
.
File metadata
- Download URL: py_inkwell-0.0.22-py3-none-any.whl
- Upload date:
- Size: 19.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.4 CPython/3.10.12 Linux/6.2.0-37-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ee9fa6e75fd67f39cb2f32bf83d9d4da2e71a30ad308661b4013208a83b604e4 |
|
MD5 | 2bac4dd609c621944e55d806860f925c |
|
BLAKE2b-256 | d87cab5991e2ecba6e441fb017a7a15e7f233f932e131a6518788e4db7939735 |