Skip to main content

Python library for document processing

Project description

Inkwell

Quickstart on Colab

Quickstart on Colab

Overview

Inkwell is a modular Python library for extracting information from documents. It is designed to be flexible and easy to extend. You can easily swap out components of the pipeline, and add your own components, using custom models or a cloud-based API. This makes it easy to integrate any open-source or cloud based API for any of the components.

We have implemented several open-source models and frameworks (listed below) and we are working on adding more state-of-the-art models.

  • Layout Detection: Faster RCNN, LayoutLMv3, Paddle
  • Table Detection: Table Transformer
  • Table Data Extraction: Phi3.5-Vision, Qwen2 VL 2B, Table Transformer, OpenAI 4o Mini
  • OCR: Tesseract, PaddleOCR, Phi3.5-Vision, Qwen2 VL 2B

Installation

pip install py-inkwell

In addition, install detectron2

pip install git+https://github.com/facebookresearch/detectron2.git

Install Tesseract for your Operating System

Ubuntu

sudo apt install tesseract-ocr
sudo apt install libtesseract-dev

Mac OS

brew install tesseract

If you want to run the pipeline on GPU for the Vision Language Models, install flash attention

pip install flash-attn --no-build-isolation

Basic Usage

from inkwell.pipeline import Pipeline

pipeline = Pipeline()
document = pipeline.process("/path/to/file.pdf")

for page in document.pages:

    figures = page.image_fragments()
    tables = page.table_fragments()
    text_blocks = page.text_fragments()

    # Check the content of the image fragments
    for figure in figures:
        figure_image = figure.content.image
        print(f"Text in figure:\n{figure.content.text}")
    
    # Check the content of the table fragments
    for table in tables:
        table_image = table.content.image
        print(f"Table detected: {table.content.data}")

    # Check the content of the text blocks
    for text_block in text_blocks:
        text_block_image = text_block.content.image
        print(f"Text block detected: {text_block.content.text}")

Models/Frameworks currently available

Default models: We have defined a config class here, and we use the default CPU Config in the pipeline for best results. If you want to use the default GPU pipeline, you can instantiate it with the GPU config class.

from inkwell.pipeline import DefaultGPUPipelineConfig, Pipeline
config = DefaultGPUPipelineConfig()
pipeline = Pipeline(config=config)

Changing the configuration

If you want to change the default models, you can replace them with models listed below by passing them in the config during pipeline initialization:

from inkwell.pipeline import PipelineConfig, Pipeline
from inkwell.layout_detector import LayoutDetectorType
from inkwell.ocr import OCRType
from inkwell.table_detector import TableDetectorType, TableExtractorType

config = PipelineConfig(
    layout_detector=LayoutDetectorType.FASTER_RCNN,
    table_extractor=TableExtractorType.PHI3_VISION,
)

pipeline = Pipeline(config=config)

Advanced Customizations

You can add custom detectors and other components to the pipeline yourself - follow the instructions in the Custom Components notebook

Acknowledgements

We derived inspiration from several open-source libraries in our implementation, like Layout Parser and Deepdoctection. We would like to thank the contributors to these libraries for their work.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py_inkwell-0.0.6.tar.gz (19.2 MB view details)

Uploaded Source

Built Distribution

py_inkwell-0.0.6-py3-none-any.whl (19.2 MB view details)

Uploaded Python 3

File details

Details for the file py_inkwell-0.0.6.tar.gz.

File metadata

  • Download URL: py_inkwell-0.0.6.tar.gz
  • Upload date:
  • Size: 19.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.5 Darwin/23.4.0

File hashes

Hashes for py_inkwell-0.0.6.tar.gz
Algorithm Hash digest
SHA256 90a8a7df92c45f33e2a79bbd6119debf53b9f5dc9d28e299f844142ef77c06e7
MD5 76412d110aa73e6bf3a8bc83c1f8e445
BLAKE2b-256 beed0aef050c65fb79b05a2fa7a23762e24098e38ca36aca4e04a8cfa0ac0763

See more details on using hashes here.

File details

Details for the file py_inkwell-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: py_inkwell-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 19.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.5 Darwin/23.4.0

File hashes

Hashes for py_inkwell-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 3d4c97e40c4b0bc9e5c2de17768d6a11d087573142cc2ebd415945478c2a45d4
MD5 d2aa74d96679f7f2b95c84edf3d5ff65
BLAKE2b-256 0ad42695eba454423485f18c46da2bb1a45a8dedd5e0030e46792d00ac8ea1db

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page