A document classification library using ModernBERT

These details have not been verified by PyPI

Project description

ModernBERT Document Classification

A robust document classification library leveraging ModernBERT and Docling for state-of-the-art text extraction and classification.

Features

Advanced Extraction: Uses docling and langchain to parse complex PDFs, images, and text files with OCR and layout analysis.
ModernBERT: Finetuned on your data for high-performance classification.
End-to-End Pipeline: From raw files to trained model and inference.
Metrics & Visualization: Automated generation of Confusion Matrices, Classification Reports (Precision, Recall, F1).
CLI Support: Easy-to-use command-line interface.

Installation

pip install .

Note: Requires torch, docling, and tesseract-ocr/poppler-utils (for PDF/Image processing).

Usage

1. Data Preparation

Organize your data in a folder (e.g., data_root which is the default in config) where each subfolder represents a class.

data_root/
  ├── invoice/
  │   ├── file1.pdf
  │   └── file2.png
  └── resume/
      ├── file3.txt
      └── file4.pdf

2. Training (One-Step)

Run the training command. If dataset.csv does not exist, the library will automatically extract text from files in data_dir (defined in configs/train_config.yaml) before training.

python3 -m modernbert_doc_cls.cli train --config configs/train_config.yaml

Note: You can still run modernbert-cls extract manually if preferred. Create a configuration file (e.g., configs/train_config.yaml):

data_path: "dataset.csv"
model_id: "answerdotai/ModernBERT-base"
results_dir: "results"

training_args:
  train_batch_size: 4
  eval_batch_size: 4
  learning_rate: 2e-5
  num_train_epochs: 3
  save_strategy: "epoch"
  eval_strategy: "epoch"
  logging_strategy: "steps"
  logging_steps: 10
  save_total_limit: 2
  metric_for_best_model: "f1"

Run training:

python3 -m modernbert_doc_cls.cli train --config configs/train_config.yaml

The model and metrics will be saved in results/.

4. Inference

CLI Usage

Predict the class of a new document.

python3 -m modernbert_doc_cls.cli predict \
  --model_path results/model \
  --encoder_path results/label_encoder.pkl \
  --file path/to/document.pdf

Python API Usage

You can also use the library directly in your Python scripts:

from modernbert_doc_cls.inference import DocumentClassifier

# Initialize the classifier
classifier = DocumentClassifier(
    model_path="results/model",
    encoder_path="results/label_encoder.pkl"
)

# Predict
label, confidence = classifier.predict("data_root/class2/invoice.pdf")
print(f"Predicted: {label} ({confidence:.4f})")

License

MIT License

Project details

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.0

Jan 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

modernbert_doc_cls-0.1.0.tar.gz (10.6 kB view details)

Uploaded Jan 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

modernbert_doc_cls-0.1.0-py3-none-any.whl (11.1 kB view details)

Uploaded Jan 6, 2026 Python 3

File details

Details for the file modernbert_doc_cls-0.1.0.tar.gz.

File metadata

Download URL: modernbert_doc_cls-0.1.0.tar.gz
Upload date: Jan 6, 2026
Size: 10.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for modernbert_doc_cls-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`1abc8ab84d51c99f52379af40349f4ed27ca3e19f799280ee2411979b40ebb89`
MD5	`3ca978c6b4d5ce6788b8fb4e98a3b0b2`
BLAKE2b-256	`45b60a9828ba39eb695f4eeab8e4518912ad51d8e8ba2ab6e3587af068631518`

See more details on using hashes here.

File details

Details for the file modernbert_doc_cls-0.1.0-py3-none-any.whl.

File metadata

Download URL: modernbert_doc_cls-0.1.0-py3-none-any.whl
Upload date: Jan 6, 2026
Size: 11.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for modernbert_doc_cls-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dca2df53246d50d470e81b20184f9cef459f107896a05c204a80bd383348fb44`
MD5	`9515b6dbde687c9f9d8b47ca6f865b5d`
BLAKE2b-256	`6eaaee2bf190ee5c8c222ba1a274dc2d104308a2a5438a304497f5cbbac4e383`

See more details on using hashes here.

modernbert-doc-cls 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

ModernBERT Document Classification

Features

Installation

Usage

1. Data Preparation

2. Training (One-Step)

4. Inference

CLI Usage

Python API Usage

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes