Skip to main content

A document classification library using ModernBERT

Project description

ModernBERT Document Classification

A robust document classification library leveraging ModernBERT and Docling for state-of-the-art text extraction and classification.

Features

  • Advanced Extraction: Uses docling and langchain to parse complex PDFs, images, and text files with OCR and layout analysis.
  • ModernBERT: Finetuned on your data for high-performance classification.
  • End-to-End Pipeline: From raw files to trained model and inference.
  • Metrics & Visualization: Automated generation of Confusion Matrices, Classification Reports (Precision, Recall, F1).
  • CLI Support: Easy-to-use command-line interface.

Installation

pip install .

Note: Requires torch, docling, and tesseract-ocr/poppler-utils (for PDF/Image processing).

Usage

1. Data Preparation

Organize your data in a folder (e.g., data_root which is the default in config) where each subfolder represents a class.

data_root/
  ├── invoice/
  │   ├── file1.pdf
  │   └── file2.png
  └── resume/
      ├── file3.txt
      └── file4.pdf

2. Training (One-Step)

Run the training command. If dataset.csv does not exist, the library will automatically extract text from files in data_dir (defined in configs/train_config.yaml) before training.

python3 -m modernbert_doc_cls.cli train --config configs/train_config.yaml

Note: You can still run modernbert-cls extract manually if preferred. Create a configuration file (e.g., configs/train_config.yaml):

data_path: "dataset.csv"
model_id: "answerdotai/ModernBERT-base"
results_dir: "results"

training_args:
  train_batch_size: 4
  eval_batch_size: 4
  learning_rate: 2e-5
  num_train_epochs: 3
  save_strategy: "epoch"
  eval_strategy: "epoch"
  logging_strategy: "steps"
  logging_steps: 10
  save_total_limit: 2
  metric_for_best_model: "f1"

Run training:

python3 -m modernbert_doc_cls.cli train --config configs/train_config.yaml

The model and metrics will be saved in results/.

4. Inference

CLI Usage

Predict the class of a new document.

python3 -m modernbert_doc_cls.cli predict \
  --model_path results/model \
  --encoder_path results/label_encoder.pkl \
  --file path/to/document.pdf

Python API Usage

You can also use the library directly in your Python scripts:

from modernbert_doc_cls.inference import DocumentClassifier

# Initialize the classifier
classifier = DocumentClassifier(
    model_path="results/model",
    encoder_path="results/label_encoder.pkl"
)

# Predict
label, confidence = classifier.predict("data_root/class2/invoice.pdf")
print(f"Predicted: {label} ({confidence:.4f})")

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

modernbert_doc_cls-0.1.0.tar.gz (10.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

modernbert_doc_cls-0.1.0-py3-none-any.whl (11.1 kB view details)

Uploaded Python 3

File details

Details for the file modernbert_doc_cls-0.1.0.tar.gz.

File metadata

  • Download URL: modernbert_doc_cls-0.1.0.tar.gz
  • Upload date:
  • Size: 10.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for modernbert_doc_cls-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1abc8ab84d51c99f52379af40349f4ed27ca3e19f799280ee2411979b40ebb89
MD5 3ca978c6b4d5ce6788b8fb4e98a3b0b2
BLAKE2b-256 45b60a9828ba39eb695f4eeab8e4518912ad51d8e8ba2ab6e3587af068631518

See more details on using hashes here.

File details

Details for the file modernbert_doc_cls-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for modernbert_doc_cls-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dca2df53246d50d470e81b20184f9cef459f107896a05c204a80bd383348fb44
MD5 9515b6dbde687c9f9d8b47ca6f865b5d
BLAKE2b-256 6eaaee2bf190ee5c8c222ba1a274dc2d104308a2a5438a304497f5cbbac4e383

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page