Skip to main content

A tool to classify images

Project description

DocumentClassifier is a Python library that provides functionality for classifying documents based on images and text content.

This library is designed to help you process and organize large sets of documents, making it useful for various applications such as image-based document classification and clustering.

Usage

Prepare a folder of image documents

import DocumentsClassifier as DC

# declare folder path
images_path = 'path/to/your/documents/folder'

# using function to classify
DC.classify(images_path)
 >> Clusterd successfully

After running the code, your images will be classified into subfolders in the root you declared:

Some limitations

This package need to load some pretrained on Huggingface:

  1. Defsent: Model for words embeddings.
  2. Classify Model: Our pretrained Beit for classifying images.
  3. Image Extractor: Our pretrained Beit for extracting features.

And we also use PaddleOCR for extracting texts so It maybe slow for the first time because It has to download pretrained.

We are in developing process so thanks for your patience.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact & Contributing

If you have any questions or suggestions, please contact us at hungdtse171849@fpt.edu.vn or phuongtnse161960@fpt.edu.vn.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Documents-Classifier-0.0.1.tar.gz (9.0 kB view hashes)

Uploaded Source

Built Distribution

Documents_Classifier-0.0.1-py3-none-any.whl (8.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page