deepdoctection·PyPI

Repository for Document AI

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
License
- OSI Approved :: Apache Software License
Natural Language
- English
Operating System
- POSIX :: Linux
Programming Language
- Python :: 3.8
Topic
- Scientific/Engineering :: Artificial Intelligence

Project description

Deep Doctection Logo

A Document AI Package

deepdoctection is a Python package that enables document analysis pipelines to be built using deep learning models.

Extracting information from documents is difficult. Documents often have a complex visual structure and the information they contain is not tagged. deepdoctection is a tool box that is intended to facilitate entry into this topic.

Parse your document by detecting layout structures like tables with full table semantics (cells, rows, columns), get text in reading order with OCR, detect language and do many other things.

The focus should be on application. deepdoctection is made for data scientists who are tasked with supporting departments in process optimization or for analysts who have to investigate into large sets of documents.

For further text processing tasks, use one of the many other great NLP libraries.

image info

Characteristics

Use an off the shelf analyzer for restructuring your PDF or scanned documents:
- Layout recognition with deep neural networks from well renowned open source libraries (Cascade-RCNN from Tensorpack or Detectron2) trained on large public datasets. Tensorflow or PyTorch models available.
- Table extraction with full table semantics (rows, columns, multi line cell spans), again with help of Cascade-RCNN
- OCR or text mining with Tesseract, DocTr, pdfplumber or other
- Reading order
- Language detection with fastText
- Parsed output available as JSON object for further NLP tasks, labeling or reviewing

Off the shelf actually means off the shelf. The results will look okay, but useful outputs for downstream tasks will only come out when models are adapted to actual documents you deal with. Therefore:

Fine-tune pre-trained DNN on your own labeled dataset. Use generally acknowledged metrics for evaluating training improvements. Training scripts available.
Compose your document analyzer by choosing a model and plug it into your own pipeline. For example, you can use pdfplumber if you have native PDF documents. Or you can benchmark OCR results with AWS Textract (account needed and paid service).
Wrap DNNs from open source projects into the deepdoctections API and enrich your pipeline easily with SOTA models.
All models are now available at the :hugs: Huggingface Model Hub . You can acquire more details in the respective model cards.

Check this notebook for an easy start, as well as the full documentation.

Requirements

Platform and Python

Before you start, please ensure your installation fulfills the following requirements:

Linux or macOS
Python >= 3.8
PyTorch >= 1.8 and torchvision or Tensorflow >=2.4.1 and CUDA

Windows is not supported.

You can run on PyTorch with a CPU only. For Tensorflow a GPU is required.

Other

deepdoctection uses Python wrappers for Poppler to convert PDF documents into images and for calling Tesseract OCR engine. If you get started and want to run the notebooks for the first time it is required to have them installed as well.

Installation

We recommend using a virtual environment. You can install the package via pip or from source. Bug fixes or enhancements will be deployed to PyPi every 4 to 6 weeks.

Install with pip from PyPi

Dataflow is not available on the PyPi server and must be installed separately.

pip install  "dataflow @ git+https://github.com/tensorpack/dataflow.git"

Depending on which Deep Learning library is available, use the following installation option:

For Tensorflow, run

pip install deepdoctection[tf]

For PyTorch, first install Detectron2 separately as it is not on the PyPi, either. Check the instruction here. Then run

pip install deepdoctection[pt]

This will install the basic setup which is needed to run the first two notebooks and do some inference with pipelines.

Some libraries are not added to the requirements in order to keep the dependencies as small as possible (e.g. DocTr, pdfplumber, fastText, ...). If you want to use them, you have to pip install them individually by yourself. Alternatively, consult the full installation instructions.

Installation from source

Download the repository or clone via

git clone https://github.com/deepdoctection/deepdoctection.git

To get started with Tensorflow, run:

cd deepdoctection
pip install ".[source-tf]"

or with PyTorch:

cd deepdoctection
pip install ".[source-pt]"

This will install the basic dependencies to get started with the first notebooks. To get all package extensions,

cd deepdoctection
pip install ".[source-all-tf]"

cd deepdoctection
pip install ".[source-all-pt]"

will install all available external libraries that can be used for inference (e.g. DocTr, pdfplumber, fastText, ...).

Again, for other installation options check this site.

Credits

Many utilities, concepts and models are inspired or taken from Tensorpack, Detectron2, Transformers. We heavily make use of Dataflow for loading and streaming data.

Problems

We try hard to eliminate bugs. We also know that the code is not free of issues. We welcome all issues relevant to this repo and try to address them as quickly as possible.

Citing deepdoctection

If you use deepdoctection in your research or in your project, please cite:

@misc{jmdeepdoctection,
  title={deepdoctection},
  author={Meyer, Dr. Janis and others},
  howpublished={\url{https://github.com/deepdoctection/deepdoctection}},
  year={2021}
}

License

Distributed under the Apache 2.0 License. Check LICENSE for additional information.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
License
- OSI Approved :: Apache Software License
Natural Language
- English
Operating System
- POSIX :: Linux
Programming Language
- Python :: 3.8
Topic
- Scientific/Engineering :: Artificial Intelligence

Release history Release notifications | RSS feed

0.43.5

Jul 11, 2025

0.43.4

Jul 8, 2025

0.43.3

Jun 30, 2025

0.43.2

Jun 28, 2025

0.43.1

Jun 23, 2025

0.43

Jun 17, 2025

0.42.1

Jun 16, 2025

0.42.0

Apr 10, 2025

0.41.0

Mar 31, 2025

0.40.0

Mar 27, 2025

0.39.7

Mar 25, 2025

0.39.6

Mar 18, 2025

0.39.5

Mar 16, 2025

0.39.4

Mar 11, 2025

0.39.3

Mar 10, 2025

0.39.2

Feb 23, 2025

0.39.1

Feb 14, 2025

0.39

Feb 8, 2025

0.38

Jan 13, 2025

0.37.3

Dec 11, 2024

0.37.2

Dec 10, 2024

0.37.1

Dec 4, 2024

0.37

Dec 3, 2024

0.36

Nov 25, 2024

0.35

Nov 10, 2024

0.34

Oct 1, 2024

0.33

Jul 15, 2024

0.32

May 12, 2024

0.31

Apr 9, 2024

0.30

Feb 20, 2024

0.29

Dec 22, 2023

0.28

Nov 13, 2023

0.27

Oct 4, 2023

0.26

Jul 29, 2023

0.25

Jun 30, 2023

0.24

Jun 9, 2023

0.23

May 7, 2023

0.22

Mar 23, 2023

0.21

Feb 13, 2023

0.20

Dec 16, 2022

0.19

Nov 17, 2022

0.18

Nov 15, 2022

0.17

Sep 29, 2022

0.16

Aug 21, 2022

This version

0.15

Jun 22, 2022

0.14

Jun 22, 2022

0.13

May 17, 2022

0.12

Apr 4, 2022

0.11

Mar 11, 2022

0.10

Mar 11, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepdoctection-0.15.tar.gz (259.6 kB view details)

Uploaded Jun 22, 2022 Source

Built Distribution

deepdoctection-0.15-py3-none-any.whl (407.6 kB view details)

Uploaded Jun 22, 2022 Python 3

File details

Details for the file deepdoctection-0.15.tar.gz.

File metadata

Download URL: deepdoctection-0.15.tar.gz
Upload date: Jun 22, 2022
Size: 259.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.8.10

File hashes

Hashes for deepdoctection-0.15.tar.gz
Algorithm	Hash digest
SHA256	`2776547827bee144c14ee7f55da4c1f2e137a06908f8ec33d854e456f65cd939`
MD5	`83fd6e7d15f0a8768727374c7d2e13cd`
BLAKE2b-256	`d1f40683d350979317f5c48da9f1d21ffabdf0c5e49b9435c160b89c4662be0b`

See more details on using hashes here.

File details

Details for the file deepdoctection-0.15-py3-none-any.whl.

File metadata

Download URL: deepdoctection-0.15-py3-none-any.whl
Upload date: Jun 22, 2022
Size: 407.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.8.10

File hashes

Hashes for deepdoctection-0.15-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ea6c78ad915b4be4dfbb7c5d35e5b07f1db99428e952512707192b3b2d3b4261`
MD5	`b61933804a789295eaebbbc1882835aa`
BLAKE2b-256	`cb7555bda4da83f5e91aa78175abfb29f699dccc81d9f876ee68367d6c087222`

See more details on using hashes here.

deepdoctection 0.15

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

A Document AI Package

Characteristics

Requirements

Platform and Python

Other

Installation

Install with pip from PyPi

Installation from source

Credits

Problems

Citing deepdoctection

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes