Skip to main content

Repository for Document AI

Project description

Deep Doctection Logo

A Document AI Package

deepdoctection is a Python package that enables document analysis pipelines to be built using deep learning models.

Extracting information from documents is difficult. Documents often have a complex visual structure and the information they contain is not tagged. deepdoctection is a tool box that is intended to facilitate entry into this topic.

The focus should be on application. deepdoctection is made for data scientists who are tasked with supporting departments in process optimization. For analysts who have to investigate into large sets of documents. And also maybe for researchers who would like to see how well their new model fits into an extraction pipeline.

It currently focuses on raw text extraction. For further text processing tasks, use one of the many other great NLP libraries.

image info

Characteristics

  1. Use an off the shelf analyzer for restructuring your PDF or scanned documents:

    • layout recognition with deep neural networks (Mask-RCNN and more) trained on large public datasets
    • table extraction with full table semantics (rows, columns, multi line cell spans), again with DNN
    • OCR and word assignment to detected layouts components
    • reading order

    Off the shelf actually means off the shelf. The results will look okay, but useful outputs for downstream tasks will only come out when models are adapted to actual documents you deal with. Therefore:

  2. Fine-tune pre-trained DNN on your own labeled dataset. Use generally acknowledged metrics for evaluating training improvements.

  3. Compose your document analyzer by choosing a model and plug it into your own pipeline.

  4. Wrap DNNs from open source projects into the deepdoctections API and enrich your pipeline easily with SOTA models.

  5. All models are now available at the :hugs: Huggingface Model Hub . You can acquire more details in the respective model cards.

Check this notebook for an easy start, as well as the full documentation.

Requirements

  • Linux or macOS
  • Python >= 3.8
  • PyTorch >= 1.8 and torchvision or Tensorflow >=2.4.1 and CUDA

You can run on PyTorch with a CPU only. For Tensorflow a GPU is required.

deepdoctection uses Tensorpack as training framework as well as its vision models for layout analysis. For PyTorch, Detectron2 is used. All models have been trained on Tensorflow and converted into Detectron2 consumable artefacts. Prediction results in PyTorch are therefore slightly worse.

If you do not work on Linux, one easy way to fulfill the requirements is to use the Docker image. A Dockerfile is provided, please follow the official instructions on how to use it.

Depending on the pipeline you want to use, you will be notified if further installations are necessary, e.g.

Installation

We recommend using a virtual environment. You can install the package via pip or from source.

Install with pip

Dataflow is not available via pip and must be installed separately.

pip install  "dataflow @ git+https://github.com/tensorpack/dataflow.git"

Depending on which Deep Learning library is available, use the following installation option:

For Tensorflow, run

pip install deepdoctection[tf]

For PyTorch,

first install Detectron2 separately. Check the instruction here. Then run

pip install deepdoctection[pt]

Installation from source

Download the repository or clone via

git clone https://github.com/deepdoctection/deepdoctection.git

There is a Makefile that guides you though the installation process. To get started, try:

cd deepdoctection
make clean
make venv
source venv/bin/activate

For Tensorflow, run

make install-dd-tf

If you want to use the PyTorch framework, run:

make install-dd-pt

For more installation options check this site.

Credits

Many utils, concepts and some models are inspired and taken from Tensorpack . We heavily make use of Dataflow for loading and streaming data.

Problems

We try hard to eliminate bugs. We also know that the code is not free of issues. We welcome all issues relevant to this repo and try to address them as quickly as possible.

Citing deepdoctection

If you use deepdoctection in your research or in your project, please cite:

@misc{jmdeepdoctection,
  title={deepdoctection},
  author={Meyer, Dr. Janis and others},
  howpublished={\url{https://github.com/deepdoctection/deepdoctection}},
  year={2021}
}

License

Distributed under the Apache 2.0 License. Check LICENSE for additional information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepdoctection-0.12.tar.gz (236.9 kB view details)

Uploaded Source

Built Distribution

deepdoctection-0.12-py3-none-any.whl (375.9 kB view details)

Uploaded Python 3

File details

Details for the file deepdoctection-0.12.tar.gz.

File metadata

  • Download URL: deepdoctection-0.12.tar.gz
  • Upload date:
  • Size: 236.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.8.10

File hashes

Hashes for deepdoctection-0.12.tar.gz
Algorithm Hash digest
SHA256 d150cb3ceb7a98a8b14e964482008eeb7b5dd45a460985cc4eeffab44ce8ce2e
MD5 0c88941c31b914c24c9d2fcda6bd01aa
BLAKE2b-256 762cb2ce394167538004209976a79c244f0c9c2664e9f6e3f4da582dd2dfc757

See more details on using hashes here.

File details

Details for the file deepdoctection-0.12-py3-none-any.whl.

File metadata

File hashes

Hashes for deepdoctection-0.12-py3-none-any.whl
Algorithm Hash digest
SHA256 696e9a26a6fc481b8c92a09b5a83b0b4b18550c537645752070c4509c40e93cb
MD5 7d14ee9993110c8e4e77df9d5cb36c36
BLAKE2b-256 b564652216f81f1bc862a023da15ab2e694b824e4a0dee8b08071f9a51a089d3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page