Skip to main content

Extract valuable text information from your documents

Project description

DocTR: Document Text Recognition

License Build Status codecov CodeFactor Codacy Badge Doc Status Pypi

Optical Character Recognition made seamless & accessible to anyone, powered by TensorFlow 2

What you can expect from this repository:

  • efficient ways to parse textual information (localize and identify each word) from your documents
  • guidance on how to integrate this in your current architecture

Quick Tour

Getting your pretrained model

End-to-End OCR is achieved in DocTR using a two-stage approach: text detection (localizing words), then text recognition (identify all characters in the word). As such, you can select the architecture used for text detection, and the one for text recognition from the list of available implementations.

from doctr.models import ocr_predictor

model = ocr_predictor(det_arch='db_resnet50', reco_arch='crnn_vgg16_bn', pretrained=True)

Reading files

Documents can be interpreted from PDF or images:

from doctr.documents import DocumentFile
# PDF
pdf_doc = DocumentFile.from_pdf("path/to/your/doc.pdf").as_images()
# Image
single_img_doc = DocumentFile.from_images("path/to/your/img.jpg")
# Webpage
webpage_doc = DocumentFile.from_url("https://www.yoursite.com").as_images()
# Multiple page images
multi_img_doc = DocumentFile.from_images(["path/to/page1.jpg", "path/to/page2.jpg"])

Putting it together

Let's use the default pretrained model for an example:

from doctr.documents import DocumentFile
from doctr.models import ocr_predictor

model = ocr_predictor(pretrained=True)
# PDF
doc = DocumentFile.from_pdf("path/to/your/doc.pdf").as_images()
# Analyze
result = model(doc)

To make sense of your model's predictions, you can visualize them as follows:

result.show(doc)

DocTR example

or export them to JSON format (to get a better understanding of our document model, check our documentation):

json_output = result.export()

Installation

Python 3.6 (or higher) and pip are required to install DocTR.

You can install the latest release of the package using pypi as follows:

pip install python-doctr

Or you can install it from source:

git clone https://github.com/mindee/doctr.git
pip install -e doctr/.

Models architectures

Credits where it's due: this repository is implementing, among others, architectures from published research papers.

Text Detection

Text Recognition

More goodies

Documentation

The full package documentation is available here for detailed specifications.

Demo app

A minimal demo app is provided for you to play with the text detection model!

You will need an extra dependency (Streamlit) for the app to run:

pip install -r demo/requirements.txt

You can then easily run your app in your default browser by running:

streamlit run demo/app.py

Demo app

Docker container

If you are to deploy containerized environments, you can use the provided Dockerfile to build a docker image:

docker build . -t <YOUR_IMAGE_TAG>

Example script

An example script is provided for a simple documentation analysis of a PDF or image file:

python scripts/analyze.py path/to/your/doc.pdf

All script arguments can be checked using python scripts/analyze.py --help

Minimal API integration

Looking to integrate DocTR into your API? Here is a template to get you started with a fully working API.

Manual setup

Specific dependencies are required to run the API template, which you can install as follows:

pip install -r api/requirements.txt

You can now run your API locally:

uvicorn --reload --workers 1 --host 0.0.0.0 --port=8050 --app-dir api/ app.main:app

Docker setup

You can run the same server on a docker container if you prefer using:

PORT=8050 docker-compose up -d --build

Contributing

If you scrolled down to this section, you most likely appreciate open source. Do you feel like extending the range of our supported characters? Or perhaps submitting a paper implementation? Or contributing in any other way?

You're in luck, we compiled a short guide (cf. CONTRIBUTING) for you to easily do so!

License

Distributed under the Apache 2.0 License. See LICENSE for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

python-doctr-0.2.0.tar.gz (67.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

python_doctr-0.2.0-py3-none-any.whl (81.7 kB view details)

Uploaded Python 3

File details

Details for the file python-doctr-0.2.0.tar.gz.

File metadata

  • Download URL: python-doctr-0.2.0.tar.gz
  • Upload date:
  • Size: 67.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/53.1.0 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.5

File hashes

Hashes for python-doctr-0.2.0.tar.gz
Algorithm Hash digest
SHA256 df8509089671b7dc4e56ea3b760cbce5b01e518a214821152f48197372bc64ca
MD5 016dbf839d8e8e5f5bb693b8c2537ef0
BLAKE2b-256 600234c22dccde80b296384a5c36bf5abe6deae7a398a70b6574980f1b452c01

See more details on using hashes here.

File details

Details for the file python_doctr-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: python_doctr-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 81.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/53.1.0 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.5

File hashes

Hashes for python_doctr-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f1ffe06892e355b7f3984f8d81858a1b1744c7057a6dc55059fcf3077b50c24b
MD5 688bf635ce0198c6c41b789e7257b2ad
BLAKE2b-256 a783533e94f3866c452904d9213700dec4d9e04fea5adaf325510e30728eebe3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page