Python library for document processing
Project description
Inkwell
Inkwell is a modular Python library for extracting information from documents. It is designed to be flexible and easy to extend, with a focus on document layout detection, OCR, and table detection.
You can easily swap out components of the pipeline, and add your own components, using custom models or a cloud-based API.
Installation
pip install inkwell
In addition, install detectron2 and transformers from source
pip install git+https://github.com/facebookresearch/detectron2.git
pip install git+https://github.com/huggingface/transformers.git
Install Tesseract for your Operating System
Ubuntu
sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
Mac OS
brew install tesseract
Basic Usage
from inkwell.pipeline import Pipeline
from inkwell import PipelineConfig
pipeline = Pipeline()
document = pipeline.process("/path/to/file.pdf")
for page in document.pages:
figures = page.image_fragments()
tables = page.table_fragments()
text_blocks = page.text_fragments()
# Check the content of the image fragments
for figure in figures:
figure_image = figure.content.image
print(f"Text in figure:\n{figure.content.text}")
# Check the content of the table fragments
for table in tables:
table_image = table.content.image
print(f"Table detected: {table.content.data}")
# Check the content of the text blocks
for text_block in text_blocks:
text_block_image = text_block.content.image
print(f"Text block detected: {text_block.content.text}")
Models/Frameworks currently available
Default models: We have defined a config class here, and we use the default CPU Config in the pipeline for best results. If you want to use the default GPU pipeline, you can instantiate it with the GPU config class.
from inkwell.pipeline import DefaultGPUPipelineConfig, Pipeline
config = DefaultGPUPipelineConfig()
pipeline = Pipeline(config=config)
If you want to change the default models, you can replace them with models listed below by passing them in the config during pipeline initialization:
Layout Detection
- Faster RCNN
- LayoutLMv3
Table Detection
- Table Transformer
Table Extraction
- Table Transformer
- Phi3.5-Vision
- Qwen2 VL 2B
OCR
- Tesseract
- Phi 3.5-Vision
- Qwen2 VL 2B
- OpenAI GPT-4o (requires an API key)
from inkwell.pipeline import PipelineConfig, Pipeline
from inkwell.layout_detector import LayoutDetectorType
from inkwell.ocr import OCRType
from inkwell.table_detector import TableDetectorType, TableExtractorType
config = PipelineConfig(
layout_detector=LayoutDetectorType.FASTER_RCNN,
table_extractor=TableExtractorType.PHI3_VISION,
)
pipeline = Pipeline(config=config)
Advanced Customizations
You can add custom detectors and other components to the pipeline yourself - follow the instructions in the Custom Components notebook
Acknowledgements
We derived inspiration from several open-source libraries in our implementation, like Layout Parser and Deepdoctection. We would like to thank the contributors to these libraries for their work.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file py_inkwell-0.0.2.tar.gz
.
File metadata
- Download URL: py_inkwell-0.0.2.tar.gz
- Upload date:
- Size: 19.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.12.5 Darwin/23.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 92dd2615e5c4dd369e0cb48255256401f7c18c7169aadce443a6879db205fde7 |
|
MD5 | 16bc6abf33ca8e32b7758ee32331504e |
|
BLAKE2b-256 | 1d375f53a6abc27d47a67bc1d34900c3c748cd58ae9d02609e4b35e12a3350e1 |
File details
Details for the file py_inkwell-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: py_inkwell-0.0.2-py3-none-any.whl
- Upload date:
- Size: 19.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.12.5 Darwin/23.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5bc60a84e96de0d9aebc0a8e4c01eeba5f789f4f8a256727994384d823ea8a92 |
|
MD5 | a8a2f95244a40a296e301c4fbf53da07 |
|
BLAKE2b-256 | 1c8bb5beeedb425488e15668c223cbfb181708b078d5bfec1e4460cae211120a |