Skip to main content

Document Text Recognition (docTR): deep Learning for high-performance OCR on documents.

Project description

License Build Status codecov CodeFactor Codacy Badge Doc Status Pypi Hugging Face Spaces Open In Colab

Optical Character Recognition made seamless & accessible to anyone, powered by TensorFlow 2 & PyTorch

What you can expect from this repository:

  • efficient ways to parse textual information (localize and identify each word) from your documents
  • guidance on how to integrate this in your current architecture

OCR_example

Quick Tour

Getting your pretrained model

End-to-End OCR is achieved in docTR using a two-stage approach: text detection (localizing words), then text recognition (identify all characters in the word). As such, you can select the architecture used for text detection, and the one for text recognition from the list of available implementations.

from doctr.models import ocr_predictor

model = ocr_predictor(det_arch='db_resnet50', reco_arch='crnn_vgg16_bn', pretrained=True)

Reading files

Documents can be interpreted from PDF or images:

from doctr.io import DocumentFile
# PDF
pdf_doc = DocumentFile.from_pdf("path/to/your/doc.pdf")
# Image
single_img_doc = DocumentFile.from_images("path/to/your/img.jpg")
# Webpage
webpage_doc = DocumentFile.from_url("https://www.yoursite.com")
# Multiple page images
multi_img_doc = DocumentFile.from_images(["path/to/page1.jpg", "path/to/page2.jpg"])

Putting it together

Let's use the default pretrained model for an example:

from doctr.io import DocumentFile
from doctr.models import ocr_predictor

model = ocr_predictor(pretrained=True)
# PDF
doc = DocumentFile.from_pdf("path/to/your/doc.pdf")
# Analyze
result = model(doc)

Dealing with rotated documents

Should you use docTR on documents that include rotated pages, or pages with multiple box orientations, you have multiple options to handle it:

  • If you only use straight document pages with straight words (horizontal, same reading direction), consider passing assume_straight_boxes=True to the ocr_predictor. It will directly fit straight boxes on your page and return straight boxes, which makes it the fastest option.

  • If you want the predictor to output straight boxes (no matter the orientation of your pages, the final localizations will be converted to straight boxes), you need to pass export_as_straight_boxes=True in the predictor. Otherwise, if assume_straight_pages=False, it will return rotated bounding boxes (potentially with an angle of 0°).

If both options are set to False, the predictor will always fit and return rotated boxes.

To interpret your model's predictions, you can visualize them interactively as follows:

result.show(doc)

Visualization sample

Or even rebuild the original document from its predictions:

import matplotlib.pyplot as plt

synthetic_pages = result.synthesize()
plt.imshow(synthetic_pages[0]); plt.axis('off'); plt.show()

Synthesis sample

The ocr_predictor returns a Document object with a nested structure (with Page, Block, Line, Word, Artefact). To get a better understanding of our document model, check our documentation:

You can also export them as a nested dict, more appropriate for JSON format:

json_output = result.export()

For examples & further details about the export format, please refer to this section of the documentation

Installation

Prerequisites

Python 3.6 (or higher) and pip are required to install docTR.

Since we use weasyprint, you will need extra dependencies if you are not running Linux.

For MacOS users, you can install them as follows:

brew install cairo pango gdk-pixbuf libffi

For Windows users, those dependencies are included in GTK. You can find the latest installer over here.

Latest release

You can then install the latest release of the package using pypi as follows:

pip install python-doctr

:warning: Please note that the basic installation is not standalone, as it does not provide a deep learning framework, which is required for the package to run.

We try to keep framework-specific dependencies to a minimum. You can install framework-specific builds as follows:

# for TensorFlow
pip install "python-doctr[tf]"
# for PyTorch
pip install "python-doctr[torch]"

Developer mode

Alternatively, you can install it from source, which will require you to install Git. First clone the project repository:

git clone https://github.com/mindee/doctr.git
pip install -e doctr/.

Again, if you prefer to avoid the risk of missing dependencies, you can install the TensorFlow or the PyTorch build:

# for TensorFlow
pip install -e doctr/.[tf]
# for PyTorch
pip install -e doctr/.[torch]

Models architectures

Credits where it's due: this repository is implementing, among others, architectures from published research papers.

Text Detection

Text Recognition

More goodies

Documentation

The full package documentation is available here for detailed specifications.

Demo app

A minimal demo app is provided for you to play with our end-to-end OCR models!

Demo app

Live demo

Courtesy of :hugs: HuggingFace :hugs:, docTR has now a fully deployed version available on Spaces! Check it out Hugging Face Spaces

Running it locally

If you prefer to use it locally, there is an extra dependency (Streamlit) that is required:

pip install -r demo/requirements.txt

Then run your app in your default browser with:

streamlit run demo/app.py

TensorFlow.js

Instead of having your demo actually running Python, you would prefer to run everything in your web browser? Check out our TensorFlow.js demo to get started!

TFJS demo

Docker container

If you wish to deploy containerized environments, you can use the provided Dockerfile to build a docker image:

docker build . -t <YOUR_IMAGE_TAG>

Example script

An example script is provided for a simple documentation analysis of a PDF or image file:

python scripts/analyze.py path/to/your/doc.pdf

All script arguments can be checked using python scripts/analyze.py --help

Minimal API integration

Looking to integrate docTR into your API? Here is a template to get you started with a fully working API using the wonderful FastAPI framework.

Deploy your API locally

Specific dependencies are required to run the API template, which you can install as follows:

pip install -r api/requirements.txt

You can now run your API locally:

uvicorn --reload --workers 1 --host 0.0.0.0 --port=8002 --app-dir api/ app.main:app

Alternatively, you can run the same server on a docker container if you prefer using:

PORT=8002 docker-compose up -d --build

What you have deployed

Your API should now be running locally on your port 8002. Access your automatically-built documentation at http://localhost:8002/redoc and enjoy your three functional routes ("/detection", "/recognition", "/ocr"). Here is an example with Python to send a request to the OCR route:

import requests
with open('/path/to/your/doc.jpg', 'rb') as f:
    data = f.read()
response = requests.post("http://localhost:8002/ocr", files={'file': data}).json()

Example notebooks

Looking for more illustrations of docTR features? You might want to check the Jupyter notebooks designed to give you a broader overview.

Citation

If you wish to cite this project, feel free to use this BibTeX reference:

@misc{doctr2021,
    title={docTR: Document Text Recognition},
    author={Mindee},
    year={2021},
    publisher = {GitHub},
    howpublished = {\url{https://github.com/mindee/doctr}}
}

Contributing

If you scrolled down to this section, you most likely appreciate open source. Do you feel like extending the range of our supported characters? Or perhaps submitting a paper implementation? Or contributing in any other way?

You're in luck, we compiled a short guide (cf. CONTRIBUTING) for you to easily do so!

License

Distributed under the Apache 2.0 License. See LICENSE for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

python-doctr-0.5.1.tar.gz (128.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

python_doctr-0.5.1-py3-none-any.whl (205.2 kB view details)

Uploaded Python 3

File details

Details for the file python-doctr-0.5.1.tar.gz.

File metadata

  • Download URL: python-doctr-0.5.1.tar.gz
  • Upload date:
  • Size: 128.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.62.3 importlib-metadata/4.11.1 keyring/23.5.0 rfc3986/1.5.0 colorama/0.4.4 CPython/3.8.11

File hashes

Hashes for python-doctr-0.5.1.tar.gz
Algorithm Hash digest
SHA256 b1d2b22da6a578d010622190ddc061832547a7e5395d5005388d4a7b9f49ee4a
MD5 9dd4a264313aa7ce091075d70da595ab
BLAKE2b-256 261d554b96bc2b79a11e81cc5d72df18ca1951e8a94fec7bac8158799e6487e0

See more details on using hashes here.

File details

Details for the file python_doctr-0.5.1-py3-none-any.whl.

File metadata

  • Download URL: python_doctr-0.5.1-py3-none-any.whl
  • Upload date:
  • Size: 205.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.62.3 importlib-metadata/4.11.1 keyring/23.5.0 rfc3986/1.5.0 colorama/0.4.4 CPython/3.8.11

File hashes

Hashes for python_doctr-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c6cbffbd1ad4ec48985a954065e9dc8460bf5228e714ac7c0336dbbc190056b5
MD5 3eca8bbe605b05754373b0bf0ac1e186
BLAKE2b-256 8af8609f7e82afac6d5c84d02125f9a03b06d2240ed962e4a188b4c00c846420

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page