NLP research toolkit built on HuggingFace Transformers

These details have not been verified by PyPI

Project links

Project description

micm-nlp

NLP research toolkit for tokenization, pretraining, fine-tuning, and PEFT across encoder-only, decoder-only, and encoder-decoder architectures. Built on top of HuggingFace transformers, peft, and datasets.

About

micm-nlp is a config-driven research toolkit for multilingual NLP work. It wraps the HuggingFace stack with a small set of high-level building blocks — CONFIG, TOKENIZER, DATASET, MODEL, and a unified TRAINER — that compose into reproducible training, fine-tuning, and evaluation pipelines. The toolkit was used in the Cross-Prompt Encoder for Low-Performing Languages paper (Findings of IJCNLP–AACL 2025; ACL Anthology) and in A Comparison of Different Tokenization Methods for the Georgian Language (ICNLSP 2024; ACL Anthology)

This v0.1.0 release ships two examples that exercise a single use case end-to-end: preprocessing and decoder-only PEFT fine-tuning (XPE) on an FTP-reframed multilingual dataset hosted on the HuggingFace Hub. The toolkit's underlying surface is broader than these two examples demonstrate.

Additional examples covering encoder-only text classification, encoder-decoder seq2seq, and MLM pretraining will land in subsequent releases. Contributions and issue reports are welcome.

Install

From PyPI:

pip install micm-nlp

From source (development):

git clone https://github.com/bmikaberidze/micm-nlp.git
cd micm-nlp
pip install -e ".[dev]"

Docker (recommended for reproducibility on GPU machines):

docker build -t micm-nlp .
docker run --gpus all -it --rm -v $(pwd):/app -w /app micm-nlp bash

You will also want a .env file for HuggingFace and Weights & Biases credentials:

cp .env.example .env
# Then add WANDB_API_KEY and (if needed) HF_TOKEN.

Quickstart

import micm_nlp
from micm_nlp.config import CONFIG
from micm_nlp.pipeline import run

micm_nlp.init()  # Rich pretty-printing + traceback formatting

config = CONFIG.from_yaml("examples/configs/xsc_finetune.yml")
model, test_output = run(config)

run(config) chains: load tokenizer → load and preprocess dataset → load model (with PEFT if configured) → train → evaluate. Every stage is configured by YAML; no plumbing code required.

Package tour

micm_nlp/
├── pipeline.py     # Top-level wiring: load_dataset, preprocess_dataset, load_model, run
├── config.py       # CONFIG.from_yaml; resolves nested namespaces
├── tokenizers/     # Tokenizer factory (XLM-R, BERT, BLOOM, T5, ...)
├── datasets/       # DATASET class — local + HF Hub + HF saved + CSV/TXT/JSON
├── models/         # MODEL wrapper, PEFT dispatch, XPE module, training callbacks
├── training/       # TRAINER — wraps HF Trainer with custom callbacks + WandB
└── evals/          # Metrics, confusion matrices, plotting helpers

The five-stage flow:

from micm_nlp.config import CONFIG
from micm_nlp.tokenizers.tokenizer import load as load_tokenizer
from micm_nlp.datasets.dataset import DATASET
from micm_nlp.models.model import MODEL
from micm_nlp.training.runner import TRAINER

config = CONFIG.from_yaml("path/to/config.yml")
tokenizer = load_tokenizer(config)
dataset = DATASET(config)
dataset.preprocess(tokenizer)
model = MODEL(config)
trainer = TRAINER(model, dataset, tokenizer)
test_output = trainer.run()

Examples

Example	Config	Description
`examples/preprocess_dataset.py`	`examples/configs/xsc_preprocess.yml`	Loads FTP-reframed XStoryCloze (English split) directly from the HuggingFace Hub and tokenizes it for BLOOM-560M; saves tokenized output locally.
`examples/run_model.py`	`examples/configs/xsc_finetune.yml`	Fine-tunes BLOOM-560M with XPE PEFT on the Arabic split of FTP-reframed XStoryCloze, then evaluates.

More examples — encoder-only text classification, encoder-decoder seq2seq, MLM pretraining, additional PEFT methods (LoRA, Prefix, P-Tuning) — are planned for subsequent releases.

Supported architectures

Architecture	Toolkit support	Demonstrated by example in v0.1.0
Decoder-only (BLOOM, AYA)	✅	✅
Encoder-only (BERT, XLM-R)	✅	⏳ planned
Encoder-decoder (T5)	✅	⏳ planned

PEFT methods supported by the toolkit: LoRA, Prefix Tuning, P-Tuning (SPT), Cross-Prompt Encoder (XPE). v0.1.0 examples demonstrate XPE only.

Development

pip install -e ".[dev]"
ruff check src/
ruff format src/
pytest

Contributing

Pull requests are welcome. For non-trivial changes, please open an issue first to discuss the proposed change. A CONTRIBUTORS.md will be added with the first external contribution.

Acknowledgements

micm-nlp was developed at the Muskhelishvili Institute of Computational Mathematics (MICM, Georgian Technical University), in close research collaboration with Teimuraz Saghinadze (MICM), Simon Ostermann (DFKI / CERTAIN), and Philipp Müller (Max Planck Institute for Intelligent Systems), whose joint work on the Cross-Prompt Encoder (XPE) drove much of the toolkit's design and validation.

This work was partially supported by the European Union under Horizon Europe project "GAIN" (GA #101078950) and by the German Federal Ministry of Research, Technology and Space (BMFTR) as part of the project TRAILS (01IW24005).

Citation

If you use micm-nlp in your research, please cite the package and (if relevant to your work) the XPE paper that drove its design:

@software{micm_nlp,
  author = {Mikaberidze, Beso},
  title = {micm-nlp: NLP research toolkit for multilingual fine-tuning and PEFT},
  url = {https://github.com/bmikaberidze/micm-nlp},
  version = {0.1.0},
  year = {2026},
}

@misc{mikaberidze2025crosspromptencoderlowperforminglanguages,
  title         = {Cross-Prompt Encoder for Low-Performing Languages},
  author        = {Beso Mikaberidze and Teimuraz Saghinadze and Simon Ostermann and Philipp Muller},
  year          = {2026},
  eprint        = {2508.10352},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2508.10352},
}

Contact

beso.mikaberidze@gmail.com

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

May 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

micm_nlp-0.1.0.tar.gz (88.0 kB view details)

Uploaded May 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

micm_nlp-0.1.0-py3-none-any.whl (88.6 kB view details)

Uploaded May 3, 2026 Python 3

File details

Details for the file micm_nlp-0.1.0.tar.gz.

File metadata

Download URL: micm_nlp-0.1.0.tar.gz
Upload date: May 3, 2026
Size: 88.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for micm_nlp-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d5ef1d3a8494725ebd6bf2933fd3f2e53e46613988748658438d48b083dede7f`
MD5	`1fdb733d9565f8c4be4080d107bd595a`
BLAKE2b-256	`cb59cc3249bcbd69e9e05c6c373a395b2d9bdda5523d283ccaa15ec6e7119441`

See more details on using hashes here.

File details

Details for the file micm_nlp-0.1.0-py3-none-any.whl.

File metadata

Download URL: micm_nlp-0.1.0-py3-none-any.whl
Upload date: May 3, 2026
Size: 88.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for micm_nlp-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7697731e6f3758d49abb84a79b7441773012db9a0e53e742f64f97f5aaba6c91`
MD5	`59bb9e87c312ef7bcb30006562be8417`
BLAKE2b-256	`97dbf4cf2e001eb5d3c6841a53c15aff9ac6996d0aeeed98821e3a5991a62fbf`

See more details on using hashes here.

micm-nlp 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

micm-nlp

About

Install

Quickstart

Package tour

Examples

Supported architectures

Development

Contributing

Acknowledgements

Citation

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes