NLP research toolkit built on HuggingFace Transformers
Project description
micm-nlp
NLP research toolkit for tokenization, pretraining, fine-tuning, and PEFT across encoder-only, decoder-only, and encoder-decoder architectures. Built on top of HuggingFace transformers, peft, and datasets.
About
micm-nlp is a config-driven research toolkit for multilingual NLP work. It wraps the HuggingFace stack with a small set of high-level building blocks — CONFIG, TOKENIZER, DATASET, MODEL, and a unified TRAINER — that compose into reproducible training, fine-tuning, and evaluation pipelines.
The toolkit was used in the Cross-Prompt Encoder for Low-Performing Languages paper (Findings of IJCNLP–AACL 2025; ACL Anthology) and in A Comparison of Different Tokenization Methods for the Georgian Language (ICNLSP 2024; ACL Anthology)
This v0.1.0 release ships two examples that exercise a single use case end-to-end: preprocessing and decoder-only PEFT fine-tuning (XPE) on an FTP-reframed multilingual dataset hosted on the HuggingFace Hub. The toolkit's underlying surface is broader than these two examples demonstrate.
Additional examples covering encoder-only text classification, encoder-decoder seq2seq, and MLM pretraining will land in subsequent releases. Contributions and issue reports are welcome.
Install
From PyPI:
pip install micm-nlp
From source (development):
git clone https://github.com/bmikaberidze/micm-nlp.git
cd micm-nlp
pip install -e ".[dev]"
Docker (recommended for reproducibility on GPU machines):
docker build -t micm-nlp .
docker run --gpus all -it --rm -v $(pwd):/app -w /app micm-nlp bash
You will also want a .env file for HuggingFace and Weights & Biases credentials:
cp .env.example .env
# Then add WANDB_API_KEY and (if needed) HF_TOKEN.
Quickstart
import micm_nlp
from micm_nlp.config import CONFIG
from micm_nlp.pipeline import run
micm_nlp.init() # Rich pretty-printing + traceback formatting
config = CONFIG.from_yaml("examples/configs/xsc_finetune.yml")
model, test_output = run(config)
run(config) chains: load tokenizer → load and preprocess dataset → load model (with PEFT if configured) → train → evaluate. Every stage is configured by YAML; no plumbing code required.
Package tour
micm_nlp/
├── pipeline.py # Top-level wiring: load_dataset, preprocess_dataset, load_model, run
├── config.py # CONFIG.from_yaml; resolves nested namespaces
├── tokenizers/ # Tokenizer factory (XLM-R, BERT, BLOOM, T5, ...)
├── datasets/ # DATASET class — local + HF Hub + HF saved + CSV/TXT/JSON
├── models/ # MODEL wrapper, PEFT dispatch, XPE module, training callbacks
├── training/ # TRAINER — wraps HF Trainer with custom callbacks + WandB
└── evals/ # Metrics, confusion matrices, plotting helpers
The five-stage flow:
from micm_nlp.config import CONFIG
from micm_nlp.tokenizers.tokenizer import load as load_tokenizer
from micm_nlp.datasets.dataset import DATASET
from micm_nlp.models.model import MODEL
from micm_nlp.training.runner import TRAINER
config = CONFIG.from_yaml("path/to/config.yml")
tokenizer = load_tokenizer(config)
dataset = DATASET(config)
dataset.preprocess(tokenizer)
model = MODEL(config)
trainer = TRAINER(model, dataset, tokenizer)
test_output = trainer.run()
Examples
| Example | Config | Description |
|---|---|---|
examples/preprocess_dataset.py |
examples/configs/xsc_preprocess.yml |
Loads FTP-reframed XStoryCloze (English split) directly from the HuggingFace Hub and tokenizes it for BLOOM-560M; saves tokenized output locally. |
examples/run_model.py |
examples/configs/xsc_finetune.yml |
Fine-tunes BLOOM-560M with XPE PEFT on the Arabic split of FTP-reframed XStoryCloze, then evaluates. |
More examples — encoder-only text classification, encoder-decoder seq2seq, MLM pretraining, additional PEFT methods (LoRA, Prefix, P-Tuning) — are planned for subsequent releases.
Supported architectures
| Architecture | Toolkit support | Demonstrated by example in v0.1.0 |
|---|---|---|
| Decoder-only (BLOOM, AYA) | ✅ | ✅ |
| Encoder-only (BERT, XLM-R) | ✅ | ⏳ planned |
| Encoder-decoder (T5) | ✅ | ⏳ planned |
PEFT methods supported by the toolkit: LoRA, Prefix Tuning, P-Tuning (SPT), Cross-Prompt Encoder (XPE). v0.1.0 examples demonstrate XPE only.
Development
pip install -e ".[dev]"
ruff check src/
ruff format src/
pytest
Contributing
Pull requests are welcome. For non-trivial changes, please open an issue first to discuss the proposed change. A CONTRIBUTORS.md will be added with the first external contribution.
Acknowledgements
micm-nlp was developed at the Muskhelishvili Institute of Computational Mathematics (MICM, Georgian Technical University), in close research collaboration with Teimuraz Saghinadze (MICM), Simon Ostermann (DFKI / CERTAIN), and Philipp Müller (Max Planck Institute for Intelligent Systems), whose joint work on the Cross-Prompt Encoder (XPE) drove much of the toolkit's design and validation.
This work was partially supported by the European Union under Horizon Europe project "GAIN" (GA #101078950) and by the German Federal Ministry of Research, Technology and Space (BMFTR) as part of the project TRAILS (01IW24005).
Citation
If you use micm-nlp in your research, please cite the package and (if relevant to your work) the XPE paper that drove its design:
@software{micm_nlp,
author = {Mikaberidze, Beso},
title = {micm-nlp: NLP research toolkit for multilingual fine-tuning and PEFT},
url = {https://github.com/bmikaberidze/micm-nlp},
version = {0.1.0},
year = {2026},
}
@misc{mikaberidze2025crosspromptencoderlowperforminglanguages,
title = {Cross-Prompt Encoder for Low-Performing Languages},
author = {Beso Mikaberidze and Teimuraz Saghinadze and Simon Ostermann and Philipp Muller},
year = {2026},
eprint = {2508.10352},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2508.10352},
}
Contact
beso.mikaberidze@gmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file micm_nlp-0.1.0.tar.gz.
File metadata
- Download URL: micm_nlp-0.1.0.tar.gz
- Upload date:
- Size: 88.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d5ef1d3a8494725ebd6bf2933fd3f2e53e46613988748658438d48b083dede7f
|
|
| MD5 |
1fdb733d9565f8c4be4080d107bd595a
|
|
| BLAKE2b-256 |
cb59cc3249bcbd69e9e05c6c373a395b2d9bdda5523d283ccaa15ec6e7119441
|
File details
Details for the file micm_nlp-0.1.0-py3-none-any.whl.
File metadata
- Download URL: micm_nlp-0.1.0-py3-none-any.whl
- Upload date:
- Size: 88.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7697731e6f3758d49abb84a79b7441773012db9a0e53e742f64f97f5aaba6c91
|
|
| MD5 |
59bb9e87c312ef7bcb30006562be8417
|
|
| BLAKE2b-256 |
97dbf4cf2e001eb5d3c6841a53c15aff9ac6996d0aeeed98821e3a5991a62fbf
|