Skip to main content

Awesome document classifcation - Implementation of major techniques

Project description

Document Classification: All in one place

This package provides support to classify documents using all the popular avialable methods. Along with document classification, it also provides support to a single interface for OCR using both open source models like: Tesseract and PaddleOCR, and commercial models like Google OCR, etc.

PYPI: document-classification

Features

  • OCR
    • Tesseract
    • Google OCR
  • Classification
    • Fasttext (train, evaluate, predict)
    • Language Models like BERT (train, evaluate, predict)
    • Language + Layout Models like LayoutLM (train, evaluate, predict)
    • LLM (evaluate, predict)

Installation

Install with a single command:

pip install -U document-classification

or if you use poetry (like me):

poetry add document-classification

Usuage

Please check the examples directory for examples on how to use the package.

Contributing

Your contributions are welcome! If you have great examples or find neat patterns, clone the repo and add another example. The goal is to find great patterns and cool examples to highlight.

If you encounter any issues or want to provide feedback, you can create an issue in this repository. You can also reach out to me on Twitter at @amittimalsina14.

Check the todo.md file for the list of features that are coming next with their due dates.

What's coming next?

I am going to first add tests and refactor the code to make it more readable, usuable, and maintainable. Then I will release documentation and more examples.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

document_classification-0.0.2a0.tar.gz (32.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

document_classification-0.0.2a0-py3-none-any.whl (65.9 kB view details)

Uploaded Python 3

File details

Details for the file document_classification-0.0.2a0.tar.gz.

File metadata

File hashes

Hashes for document_classification-0.0.2a0.tar.gz
Algorithm Hash digest
SHA256 bd27210dae80348ec22c3def98975446a9b784ef5bc05b920327e6a00dfb3b4f
MD5 e73657d94c27602944fdbc7fddb85537
BLAKE2b-256 9b8db6df4fc18c1fd9892ad514313a688c6fddb2f8e813e9d4f393106fc71e80

See more details on using hashes here.

Provenance

The following attestation bundles were made for document_classification-0.0.2a0.tar.gz:

Publisher: python-publish.yml on amit-timalsina/document_classification

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file document_classification-0.0.2a0-py3-none-any.whl.

File metadata

File hashes

Hashes for document_classification-0.0.2a0-py3-none-any.whl
Algorithm Hash digest
SHA256 1bfdac8901cd5ca0fabc2a53444d87874512a8ac79c6f9fc7375b7ba87c6554c
MD5 ce2e1634e0067c13ecc0e6ba769862f8
BLAKE2b-256 4ed7bf2c6808fe9850170014c683926a3da29a6d6fc025648fecdfbbbca85414

See more details on using hashes here.

Provenance

The following attestation bundles were made for document_classification-0.0.2a0-py3-none-any.whl:

Publisher: python-publish.yml on amit-timalsina/document_classification

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page