Skip to main content

Repository for Document AI

Project description

Deep Doctection Logo

A Document AI Package

deepdoctection is a Python package that enables document analysis pipelines to be built using deep learning models.

Extracting information from documents is difficult. Documents often have a complex visual structure and the information they contain is not tagged. deepdoctection is a tool box that is intended to facilitate entry into this topic.

Parse your document by detecting layout structures like tables with full table semantics (cells, rows, columns), get text in reading order with OCR, detect language and do many other things.

The focus should be on application. deepdoctection is made for data scientists who are tasked with supporting departments in process optimization or for analysts who have to investigate into large sets of documents.

For further text processing tasks, use one of the many other great NLP libraries.

image info

Characteristics

  1. Use an off the shelf analyzer for restructuring your PDF or scanned documents:

    • Layout recognition with deep neural networks from well renowned open source libraries (Cascade-RCNN from Tensorpack or Detectron2) trained on large public datasets. Tensorflow or PyTorch models available.
    • Table extraction with full table semantics (rows, columns, multi line cell spans), again with help of Cascade-RCNN
    • OCR or text mining with Tesseract, DocTr, pdfplumber or other
    • Reading order
    • Language detection with fastText
    • Parsed output available as JSON object for further NLP tasks, labeling or reviewing

Off the shelf actually means off the shelf. The results will look okay, but useful outputs for downstream tasks will only come out when models are adapted to actual documents you deal with. Therefore:

  1. Fine-tune pre-trained DNN on your own labeled dataset. Use generally acknowledged metrics for evaluating training improvements. Training scripts available.

  2. Compose your document analyzer by choosing a model and plug it into your own pipeline. For example, you can use pdfplumber if you have native PDF documents. Or you can benchmark OCR results with AWS Textract (account needed and paid service).

  3. Wrap DNNs from open source projects into the deepdoctections API and enrich your pipeline easily with SOTA models.

  4. All models are now available at the :hugs: Huggingface Model Hub . You can acquire more details in the respective model cards.

Check this notebook for an easy start, as well as the full documentation.

Requirements

Platform and Python

Before you start, please ensure your installation fulfills the following requirements:

  • Linux or macOS
  • Python >= 3.8
  • PyTorch >= 1.8 and torchvision or Tensorflow >=2.4.1 and CUDA

Windows is not supported.

You can run on PyTorch with a CPU only. For Tensorflow a GPU is required.

Other

deepdoctection uses Python wrappers for Poppler to convert PDF documents into images and for calling Tesseract OCR engine. If you get started and want to run the notebooks for the first time it is required to have them installed as well.

Installation

We recommend using a virtual environment. You can install the package via pip or from source. Bug fixes or enhancements will be deployed to PyPi every 4 to 6 weeks.

Install with pip from PyPi

Dataflow is not available on the PyPi server and must be installed separately.

pip install  "dataflow @ git+https://github.com/tensorpack/dataflow.git"

Depending on which Deep Learning library is available, use the following installation option:

For Tensorflow, run

pip install deepdoctection[tf]

For PyTorch, first install Detectron2 separately as it is not on the PyPi, either. Check the instruction here. Then run

pip install deepdoctection[pt]

This will install the basic setup which is needed to run the first two notebooks and do some inference with pipelines.

Some libraries are not added to the requirements in order to keep the dependencies as small as possible (e.g. DocTr, pdfplumber, fastText, ...). If you want to use them, you have to pip install them individually by yourself. Alternatively, consult the full installation instructions.

Installation from source

Download the repository or clone via

git clone https://github.com/deepdoctection/deepdoctection.git

To get started with Tensorflow, run:

cd deepdoctection
pip install ".[source-tf]"

or with PyTorch:

cd deepdoctection
pip install ".[source-pt]"

This will install the basic dependencies to get started with the first notebooks. To get all package extensions,

cd deepdoctection
pip install ".[source-all-tf]"

or

cd deepdoctection
pip install ".[source-all-pt]"

will install all available external libraries that can be used for inference (e.g. DocTr, pdfplumber, fastText, ...).

Again, for other installation options check this site.

Credits

Many utilities, concepts and models are inspired or taken from Tensorpack, Detectron2, Transformers. We heavily make use of Dataflow for loading and streaming data.

Problems

We try hard to eliminate bugs. We also know that the code is not free of issues. We welcome all issues relevant to this repo and try to address them as quickly as possible.

Citing deepdoctection

If you use deepdoctection in your research or in your project, please cite:

@misc{jmdeepdoctection,
  title={deepdoctection},
  author={Meyer, Dr. Janis and others},
  howpublished={\url{https://github.com/deepdoctection/deepdoctection}},
  year={2021}
}

License

Distributed under the Apache 2.0 License. Check LICENSE for additional information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepdoctection-0.15.tar.gz (259.6 kB view hashes)

Uploaded Source

Built Distribution

deepdoctection-0.15-py3-none-any.whl (407.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page