Skip to main content

Repository for Document AI

Project description

Deep Doctection Logo

A Document AI Package

deepdoctection is a Python library that orchestrates document extraction and document layout analysis tasks using deep learning models. It does not implement models but enables you to build pipelines using highly acknowledged libraries for object detection, OCR and selected NLP tasks and provides an integrated frameworks for fine-tuning, evaluating and running models. For more specific text processing tasks use one of the many other great NLP libraries.

deepdoctection focuses on applications and is made for those who want to solve real world problems related to document extraction from PDFs or scans in various image formats.

Overview

deepdoctection provides model wrappers of supported libraries for various tasks to be integrated into pipelines. Its core function does not depend on any specific deep learning library. Selected models for the following tasks are currently supported:

  • Document layout analysis including table recognition in Tensorflow with Tensorpack, or PyTorch with Detectron2,
  • OCR with support of Tesseract, DocTr (Tensorflow and PyTorch implementations available) and a wrapper to an API for a commercial solution,
  • Text mining for native PDFs with pdfplumber,
  • Language detection with fastText,
  • [new!] Document and token classification with LayoutLM provided by the Transformer library. (Yes, you can use LayoutLM with any one of the provided OCR-or pdfplumber tools straight away!)

deepdoctection provides on top of that methods for pre-processing inputs to models like cropping or resizing and to post-process results, like validating duplicate outputs, relating words to detected layout segments or ordering words into contiguous text. You will get an output in JSON format that you can customize even further by yourself.

Check the demo of a document layout analysis pipeline with OCR on :hugs: Hugging Face spaces or have a look at the introduction notebook for an easy start.

Models

deepdoctection or its support libraries provide pre-trained models that are in most of the cases available at the Hugging Face Model Hub or that will be automatically downloaded once requested. For instance, you can find pre-trained object detection models from the Tensorpack or Detectron2 framework for coarse layout analysis, table cell detection and table recognition.

Datasets and training scripts

Training is a substantial part to get pipelines ready on some specific domain, let it be document layout analysis, document classification or NER. deepdoctection provides training scripts for models that are based on trainers developed from the library that hosts the model code. Moreover, deepdoctection hosts code to some well established datasets like Publaynet that makes it easy to experiment. It also contains mappings from widely used data formats like COCO and it has a dataset framework (akin to datasets so that setting up training on a custom dataset becomes very easy. Check this notebook to see, how you can easily train a model on a different domain.

Evaluation

deepdoctection comes equipped with a framework that allows you to evaluate predictions of a single or multiple models in a pipeline against some ground truth. Check here how it is done.

Inference

Having set up a pipeline it takes you a few lines of code to instantiate the pipeline and after a for loop all pages will be processed through the pipeline.

import deepdoctection as dd
from IPython.core.display import HTML
from matplotlib import pyplot as plt

analyzer = dd.get_dd_analyzer()  # instantiate the built-in analyzer similar to the Hugging Face space demo

df = analyzer.analyze(path = "/path/to/your/doc.pdf")  # setting up pipeline
df.reset_state()                 # Trigger some initialization

doc = iter(df)
page = next(doc) 

image = page.viz()
plt.figure(figsize = (25,17))
plt.axis('off')
plt.imshow(image)

text

HTML(page.tables[0].html)

table

print(page.get_text())

table

This excerpt shows how to instantiate the built-in deepdoctection analyzer as deployed on the Hugging Face space and how to get parsed result from a PDF document page by page.

Documentation

There is an extensive documentation available containing tutorials, design concepts and the API. We want to present things as comprehensively and understandably as possible. However, we are aware that there are still many areas where significant improvements can be made in terms of clarity, grammar and correctness. We look forward to every hint and comment that increases the quality of the documentation.

Requirements

requirements

Everything in the overview listed below the deepdoctection layer are necessary requirements and have to be installed separately.

  • Linux or macOS. Windows is not supported.
  • Python >= 3.8
  • PyTorch >= 1.8 or Tensorflow >= 2.8 and CUDA. If you want to run the models provided by Tensorpack a GPU is required. You can run on PyTorch with a CPU only.
  • deepdoctection uses Python wrappers for Poppler to convert PDF documents into images.
  • With respect to the Deep Learning framework, you must decide between Tensorflow and PyTorch.
  • Tesseract OCR engine will be used through a Python wrapper. The core engine has to be installed separately.

Installation

We recommend using a virtual environment. You can install the package via pip or from source. Bug fixes or enhancements will be deployed to PyPi every 4 to 6 weeks.

Install with pip from PyPi

Depending on which Deep Learning library you have available, use the following installation option:

For Tensorflow, run

pip install deepdoctection[tf]

For PyTorch,

first install Detectron2 separately as it is not distributed via PyPi. Check the instruction here. Then run

pip install deepdoctection[pt]

This will install deepdoctection with all dependencies listed above the deepdoctection layer. Use this setting, if you want to get started or want to explore all features.

If you want to have more control with your installation and are looking for fewer dependencies then install deepdoctection with the basic setup only.

pip install deepdoctection

This will ignore all model libraries (layers above the deepdoctection layer in the diagram) and you will be responsible to install them by yourself. Note, that you will not be able to run any pipeline with this setup.

For further information, please consult the full installation instructions.

Installation from source

Download the repository or clone via

git clone https://github.com/deepdoctection/deepdoctection.git

To get started with Tensorflow, run:

cd deepdoctection
pip install ".[tf]"

Installing the full PyTorch setup from source will also install Detectron2 for you:

cd deepdoctection
pip install ".[source-pt]"

Credits

We thank all libraries that provide high quality code and pre-trained models. Without, it would have been impossible to develop this framework.

Problems

We try hard to eliminate bugs. We also know that the code is not free of issues. We welcome all issues relevant to this repo and try to address them as quickly as possible.

If you like deepdoctection ...

...you can easily support the project by making it more visible. Leaving a star or a recommendation will help.

License

Distributed under the Apache 2.0 License. Check LICENSE for additional information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepdoctection-0.17.tar.gz (319.2 kB view details)

Uploaded Source

Built Distribution

deepdoctection-0.17-py3-none-any.whl (481.9 kB view details)

Uploaded Python 3

File details

Details for the file deepdoctection-0.17.tar.gz.

File metadata

  • Download URL: deepdoctection-0.17.tar.gz
  • Upload date:
  • Size: 319.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.10

File hashes

Hashes for deepdoctection-0.17.tar.gz
Algorithm Hash digest
SHA256 5e5dd7e510f78049fab9a4aee51c3d2f4efbcfe96e066975f8f05dda08251435
MD5 58b4792b4e7429fb567b3df63a12c13d
BLAKE2b-256 3bf358e361002bb71c10a4c0a3d6d96e74275a3db2f12cdf53d61bf6631d9196

See more details on using hashes here.

File details

Details for the file deepdoctection-0.17-py3-none-any.whl.

File metadata

File hashes

Hashes for deepdoctection-0.17-py3-none-any.whl
Algorithm Hash digest
SHA256 f755f23e35e6e6998923929847a51c95e3a529bf5d97b7cc6be63f792dca2a14
MD5 d3f7f2f4e7e7bd94f46e5f28a61d9bd2
BLAKE2b-256 794387c6164d7b5900e18b45b3c626a82d58cfa0286a6556dd81848a52b82880

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page