Awesome document classifcation - Implementation of major techniques
Project description
Document Classification: All in one place
This package provides support to classify documents using all the popular avialable methods. Along with document classification, it also provides support to a single interface for OCR using both open source models like: Tesseract and PaddleOCR, and commercial models like Google OCR, etc.
PYPI: document-classification
Features
- OCR
- Tesseract
- Google OCR
- Classification
- Fasttext (train, evaluate, predict)
- Language Models like BERT (train, evaluate, predict)
- Language + Layout Models like LayoutLM (train, evaluate, predict)
- LLM (evaluate, predict)
Installation
Install with a single command:
pip install -U document-classification
or if you use poetry (like me):
poetry add document-classification
Usuage
Please check the examples directory for examples on how to use the package.
Contributing
Your contributions are welcome! If you have great examples or find neat patterns, clone the repo and add another example. The goal is to find great patterns and cool examples to highlight.
If you encounter any issues or want to provide feedback, you can create an issue in this repository. You can also reach out to me on Twitter at @amittimalsina14.
Check the todo.md file for the list of features that are coming next with their due dates.
What's coming next?
I am going to first add tests and refactor the code to make it more readable, usuable, and maintainable. Then I will release documentation and more examples.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file document_classification-0.0.2a0.tar.gz.
File metadata
- Download URL: document_classification-0.0.2a0.tar.gz
- Upload date:
- Size: 32.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bd27210dae80348ec22c3def98975446a9b784ef5bc05b920327e6a00dfb3b4f
|
|
| MD5 |
e73657d94c27602944fdbc7fddb85537
|
|
| BLAKE2b-256 |
9b8db6df4fc18c1fd9892ad514313a688c6fddb2f8e813e9d4f393106fc71e80
|
Provenance
The following attestation bundles were made for document_classification-0.0.2a0.tar.gz:
Publisher:
python-publish.yml on amit-timalsina/document_classification
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
document_classification-0.0.2a0.tar.gz -
Subject digest:
bd27210dae80348ec22c3def98975446a9b784ef5bc05b920327e6a00dfb3b4f - Sigstore transparency entry: 153965794
- Sigstore integration time:
-
Permalink:
amit-timalsina/document_classification@19c886509be0ecf463b9868f4a0fa60a78d2c1c3 -
Branch / Tag:
refs/tags/0.0.2-alpha - Owner: https://github.com/amit-timalsina
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@19c886509be0ecf463b9868f4a0fa60a78d2c1c3 -
Trigger Event:
release
-
Statement type:
File details
Details for the file document_classification-0.0.2a0-py3-none-any.whl.
File metadata
- Download URL: document_classification-0.0.2a0-py3-none-any.whl
- Upload date:
- Size: 65.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1bfdac8901cd5ca0fabc2a53444d87874512a8ac79c6f9fc7375b7ba87c6554c
|
|
| MD5 |
ce2e1634e0067c13ecc0e6ba769862f8
|
|
| BLAKE2b-256 |
4ed7bf2c6808fe9850170014c683926a3da29a6d6fc025648fecdfbbbca85414
|
Provenance
The following attestation bundles were made for document_classification-0.0.2a0-py3-none-any.whl:
Publisher:
python-publish.yml on amit-timalsina/document_classification
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
document_classification-0.0.2a0-py3-none-any.whl -
Subject digest:
1bfdac8901cd5ca0fabc2a53444d87874512a8ac79c6f9fc7375b7ba87c6554c - Sigstore transparency entry: 153965795
- Sigstore integration time:
-
Permalink:
amit-timalsina/document_classification@19c886509be0ecf463b9868f4a0fa60a78d2c1c3 -
Branch / Tag:
refs/tags/0.0.2-alpha - Owner: https://github.com/amit-timalsina
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@19c886509be0ecf463b9868f4a0fa60a78d2c1c3 -
Trigger Event:
release
-
Statement type: