Skip to main content

DocQuery: An easy way to extract information from documents

Project description

NOTE: DocQuery is not actively maintained anymore. We still welcome contributions and discussions among the community!

DocQuery: Document Query Engine Powered by Large Language Models

Demo Demo PyPI Discord Downloads

DocQuery is a library and command-line tool that makes it easy to analyze semi-structured and unstructured documents (PDFs, scanned images, etc.) using large language models (LLMs). You simply point DocQuery at one or more documents and specify a question you want to ask. DocQuery is created by the team at Impira.

Quickstart (CLI)

To install docquery, you can simply run pip install docquery. This will install the command line tool as well as the library. If you want to run OCR on images, then you must also install the tesseract library:

  • Mac OS X (using Homebrew):

    brew install tesseract
    
  • Ubuntu:

    apt install tesseract-ocr
    

docquery scan allows you to ask one or more questions to a single document or directory of files. For example, you can find the invoice number https://templates.invoicehome.com/invoice-template-us-neat-750px.png with:

docquery scan "What is the invoice number?" https://templates.invoicehome.com/invoice-template-us-neat-750px.png

If you have a folder of documents on your machine, you can run something like

docquery scan "What is the effective date?" /path/to/contracts/folder

to determine the effective date of every document in the folder.

Quickstart (Library)

DocQuery can also be used as a library. It contains two basic abstractions: (1) a DocumentQuestionAnswering pipeline that makes it simple to ask questions of documents and (2) a Document abstraction that can parse various types of documents to feed into the pipeline.

>>> from docquery import document, pipeline
>>> p = pipeline('document-question-answering')
>>> doc = document.load_document("/path/to/document.pdf")
>>> for q in ["What is the invoice number?", "What is the invoice total?"]:
...     print(q, p(question=q, **doc.context))

Use cases

DocQuery excels at a number of use cases involving structured, semi-structured, or unstructured documents. You can ask questions about invoices, contracts, forms, emails, letters, receipts, and many more. You can also classify documents. We will continue evolving the model, offer more modeling options, and expanding the set of supported documents. We welcome feedback, requests, and of course contributions to help achieve this vision.

How it works

Under the hood, docquery uses a pre-trained zero-shot language model, based on LayoutLM, that has been fine-tuned for a question-answering task. The model is trained using a combination of SQuAD2.0 and DocVQA which make it particularly well suited for complex visual question answering tasks on a wide variety of documents. The underlying model is also published on HuggingFace as impira/layoutlm-document-qa which you can access directly.

Limitations

DocQuery is intended to have a small install footprint and be simple to work with. As a result, it has some limitations:

  • Models must be pre-trained. Although DocQuery uses a zero-shot model that can adapt based on the question you provide, it does not learn from your data.
  • Support for images and PDFs. Currently DocQuery supports images and PDFs, with or without embedded text. It does not support word documents, emails, spreadsheets, etc.
  • Scalar text outputs. DocQuery only produces text outputs (answers). It does not support richer scalar types (i.e. it treats numbers and dates as strings) or tables.

Advanced features

Using Donut 🍩

If you'd like to test docquery with Donut, you must install the required extras:

pip install docquery[donut]

You can then run

docquery scan "What is the effective date?" /path/to/contracts/folder --checkpoint 'naver-clova-ix/donut-base-finetuned-docvqa'

Classifying documents

To classify documents, you simply add the --classify argument to scan. You can specify any image classification model on Hugging Face's hub. By default, the classification pipeline uses Donut (which requires the installation instructions above):

# Classify documents
docquery scan --classify  /path/to/contracts/folder --checkpoint 'naver-clova-ix/donut-base-finetuned-docvqa'

# Classify documents and ask a question too
docquery scan --classify "What is the effective date?" /path/to/contracts/folder --checkpoint 'naver-clova-ix/donut-base-finetuned-docvqa'

Scraping webpages

DocQuery can read files through HTTP/HTTPs out of the box. However, if you want to read HTML documents, you can do that too by installing the [web] extension. The extension uses the webdriver-manager library which can install a Chrome driver on your system automatically, but you'll need to make sure Chrome is installed globally.

# Find the top post on hacker news
docquery scan "What is the #1 post's title?" https://news.ycombinator.com

Where to go from here

DocQuery is a swiss army knife tool for working with documents and experiencing the power of modern machine learning. You can use it just about anywhere, including behind a firewall on sensitive data, and test it with a wide variety of documents. Our hope is that DocQuery enables many creative use cases for document understanding by making it simple and easy to ask questions from your documents.

When you run DocQuery for the first time, it will download some files (e.g. the models and some library code from HuggingFace). However, nothing leaves your computer -- the OCR is done locally, models run locally, etc. This comes with the benefit of security and privacy; however, it comes at the cost of runtime performance and some accuracy.

If you find yourself wondering how to achieve higher accuracy, work with more file types, teach the model with your own data, have a human-in-the-loop workflow, or query the data you're extracting, then do not fear -- you are running into the challenges that every organization does while putting document AI into production. The Impira platform is designed to solve these problems in an easy and intuitive way. Impira comes with a QA model that is additionally trained on proprietary datasets and can achieve 95%+ accuracy out-of-the-box for most use cases. It also has an intuitive UI that enables subject matter experts to label and improve the models, as well as an API that makes integration a breeze. Please sign up for the product or reach out to us for more details.

Status

DocQuery is a new project. Although the underlying models are running in production, we've just recently released our code in open source and are actively working with the OSS community to upstream some of the changes we've made (e.g. the model and pipeline). DocQuery is rapidly changing, and we are likely to make breaking API changes. If you would like to run it in production, then we suggest pinning a version or commit hash. Either way, please get in touch with us at oss@impira.com with any questions or feedback.

Acknowledgements

DocQuery would not be possible without the contributions of many open source projects:

and many others!

License

This project is licensed under the MIT license.

It contains code that is copied and adapted from transformers (https://github.com/huggingface/transformers), which is Apache 2.0 licensed. Files containing this code have been marked as such in their comments.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docquery_test-0.0.7.tar.gz (32.9 kB view details)

Uploaded Source

Built Distribution

docquery_test-0.0.7-py3-none-any.whl (36.0 kB view details)

Uploaded Python 3

File details

Details for the file docquery_test-0.0.7.tar.gz.

File metadata

  • Download URL: docquery_test-0.0.7.tar.gz
  • Upload date:
  • Size: 32.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.10

File hashes

Hashes for docquery_test-0.0.7.tar.gz
Algorithm Hash digest
SHA256 a99144d8b8b3d2fc0a37121cb73bd0dc6a9b83d7080ac83e5bdd6814309bf30b
MD5 9fc1cd4787c72ca929e4880e76cd6084
BLAKE2b-256 db43190ff4db2ecbc9db2649b669c7f2f5a0d4737950a12e1e7d4e7a3c81fdac

See more details on using hashes here.

File details

Details for the file docquery_test-0.0.7-py3-none-any.whl.

File metadata

File hashes

Hashes for docquery_test-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 8fb325e4f57cb73d7c2b7a2405844db880af4431735e1620aa57138defbe7db1
MD5 ce69e42dcd345fac7a145be2c518f66a
BLAKE2b-256 da920ec13e23295cfa7d697321e88f4e79a7022d4504cdc98eaca2b9aa580db9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page