Skip to main content

Python library for document processing

Project description

Inkwell

Inkwell is a modular Python library for extracting information from documents. It is designed to be flexible and easy to extend, with a focus on document layout detection, OCR, and table detection.

You can easily swap out components of the pipeline, and add your own components, using custom models or a cloud-based API.

Installation

pip install inkwell

In addition, install detectron2 and transformers from source

pip install git+https://github.com/facebookresearch/detectron2.git
pip install git+https://github.com/huggingface/transformers.git

Install Tesseract for your Operating System

Ubuntu

sudo apt install tesseract-ocr
sudo apt install libtesseract-dev

Mac OS

brew install tesseract

Basic Usage

from inkwell.pipeline import Pipeline
from inkwell import PipelineConfig

pipeline = Pipeline()

document = pipeline.process("/path/to/file.pdf")

for page in document.pages:

    figures = page.image_fragments()
    tables = page.table_fragments()
    text_blocks = page.text_fragments()

    # Check the content of the image fragments
    for figure in figures:
        figure_image = figure.content.image
        print(f"Text in figure:\n{figure.content.text}")
    
    # Check the content of the table fragments
    for table in tables:
        table_image = table.content.image
        print(f"Table detected: {table.content.data}")

    # Check the content of the text blocks
    for text_block in text_blocks:
        text_block_image = text_block.content.image
        print(f"Text block detected: {text_block.content.text}")

Models/Frameworks currently available

Default models: We have defined a config class here, and we use the default CPU Config in the pipeline for best results. If you want to use the default GPU pipeline, you can instantiate it with the GPU config class.

from inkwell.pipeline import DefaultGPUPipelineConfig, Pipeline
config = DefaultGPUPipelineConfig()
pipeline = Pipeline(config=config)

If you want to change the default models, you can replace them with models listed below by passing them in the config during pipeline initialization:

Layout Detection

  • Faster RCNN
  • LayoutLMv3

Table Detection

  • Table Transformer

Table Extraction

  • Table Transformer
  • Phi3.5-Vision
  • Qwen2 VL 2B

OCR

  • Tesseract
  • Phi 3.5-Vision
  • Qwen2 VL 2B
  • OpenAI GPT-4o (requires an API key)
from inkwell.pipeline import PipelineConfig, Pipeline
from inkwell.layout_detector import LayoutDetectorType
from inkwell.ocr import OCRType
from inkwell.table_detector import TableDetectorType, TableExtractorType

config = PipelineConfig(
    layout_detector=LayoutDetectorType.FASTER_RCNN,
    table_extractor=TableExtractorType.PHI3_VISION,
)

pipeline = Pipeline(config=config)

Advanced Customizations

You can add custom detectors and other components to the pipeline yourself - follow the instructions in the Custom Components notebook

Acknowledgements

We derived inspiration from several open-source libraries in our implementation, like Layout Parser and Deepdoctection. We would like to thank the contributors to these libraries for their work.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py_inkwell-0.0.2.tar.gz (19.2 MB view details)

Uploaded Source

Built Distribution

py_inkwell-0.0.2-py3-none-any.whl (19.2 MB view details)

Uploaded Python 3

File details

Details for the file py_inkwell-0.0.2.tar.gz.

File metadata

  • Download URL: py_inkwell-0.0.2.tar.gz
  • Upload date:
  • Size: 19.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.5 Darwin/23.4.0

File hashes

Hashes for py_inkwell-0.0.2.tar.gz
Algorithm Hash digest
SHA256 92dd2615e5c4dd369e0cb48255256401f7c18c7169aadce443a6879db205fde7
MD5 16bc6abf33ca8e32b7758ee32331504e
BLAKE2b-256 1d375f53a6abc27d47a67bc1d34900c3c748cd58ae9d02609e4b35e12a3350e1

See more details on using hashes here.

File details

Details for the file py_inkwell-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: py_inkwell-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 19.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.5 Darwin/23.4.0

File hashes

Hashes for py_inkwell-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 5bc60a84e96de0d9aebc0a8e4c01eeba5f789f4f8a256727994384d823ea8a92
MD5 a8a2f95244a40a296e301c4fbf53da07
BLAKE2b-256 1c8bb5beeedb425488e15668c223cbfb181708b078d5bfec1e4460cae211120a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page