A document classification library using ModernBERT
Project description
ModernBERT Document Classification
A robust document classification library leveraging ModernBERT and Docling for state-of-the-art text extraction and classification.
Features
- Advanced Extraction: Uses
doclingandlangchainto parse complex PDFs, images, and text files with OCR and layout analysis. - ModernBERT: Finetuned on your data for high-performance classification.
- End-to-End Pipeline: From raw files to trained model and inference.
- Metrics & Visualization: Automated generation of Confusion Matrices, Classification Reports (Precision, Recall, F1).
- CLI Support: Easy-to-use command-line interface.
Installation
pip install .
Note: Requires torch, docling, and tesseract-ocr/poppler-utils (for PDF/Image processing).
Usage
1. Data Preparation
Organize your data in a folder (e.g., data_root which is the default in config) where each subfolder represents a class.
data_root/
├── invoice/
│ ├── file1.pdf
│ └── file2.png
└── resume/
├── file3.txt
└── file4.pdf
2. Training (One-Step)
Run the training command. If dataset.csv does not exist, the library will automatically extract text from files in data_dir (defined in configs/train_config.yaml) before training.
python3 -m modernbert_doc_cls.cli train --config configs/train_config.yaml
Note: You can still run modernbert-cls extract manually if preferred.
Create a configuration file (e.g., configs/train_config.yaml):
data_path: "dataset.csv"
model_id: "answerdotai/ModernBERT-base"
results_dir: "results"
training_args:
train_batch_size: 4
eval_batch_size: 4
learning_rate: 2e-5
num_train_epochs: 3
save_strategy: "epoch"
eval_strategy: "epoch"
logging_strategy: "steps"
logging_steps: 10
save_total_limit: 2
metric_for_best_model: "f1"
Run training:
python3 -m modernbert_doc_cls.cli train --config configs/train_config.yaml
The model and metrics will be saved in results/.
4. Inference
CLI Usage
Predict the class of a new document.
python3 -m modernbert_doc_cls.cli predict \
--model_path results/model \
--encoder_path results/label_encoder.pkl \
--file path/to/document.pdf
Python API Usage
You can also use the library directly in your Python scripts:
from modernbert_doc_cls.inference import DocumentClassifier
# Initialize the classifier
classifier = DocumentClassifier(
model_path="results/model",
encoder_path="results/label_encoder.pkl"
)
# Predict
label, confidence = classifier.predict("data_root/class2/invoice.pdf")
print(f"Predicted: {label} ({confidence:.4f})")
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file modernbert_doc_cls-0.1.0.tar.gz.
File metadata
- Download URL: modernbert_doc_cls-0.1.0.tar.gz
- Upload date:
- Size: 10.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1abc8ab84d51c99f52379af40349f4ed27ca3e19f799280ee2411979b40ebb89
|
|
| MD5 |
3ca978c6b4d5ce6788b8fb4e98a3b0b2
|
|
| BLAKE2b-256 |
45b60a9828ba39eb695f4eeab8e4518912ad51d8e8ba2ab6e3587af068631518
|
File details
Details for the file modernbert_doc_cls-0.1.0-py3-none-any.whl.
File metadata
- Download URL: modernbert_doc_cls-0.1.0-py3-none-any.whl
- Upload date:
- Size: 11.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dca2df53246d50d470e81b20184f9cef459f107896a05c204a80bd383348fb44
|
|
| MD5 |
9515b6dbde687c9f9d8b47ca6f865b5d
|
|
| BLAKE2b-256 |
6eaaee2bf190ee5c8c222ba1a274dc2d104308a2a5438a304497f5cbbac4e383
|