COMBO-NLP - A library for Morphosyntactic Tagging and Dependency Parsing.
Project description
COMBO-NLP
COMBO-NLP - A library for Morphosyntactic Tagging and Dependency Parsing.
Features
- Two training modes: full fine-tuning and LoRA (Low-Rank Adaptation) for parameter-efficient training
- Multi-task learning: morphosyntactic tagging, lemmatization, dependency parsing
- Combined label encoding: UPOS + XPOS + FEATS as single label for efficient classification
- Biaffine attention for dependency parsing
- Character-level seq2seq lemmatization
- CoNLL 2018 metrics: UPOS, XPOS, UFeats, AllTags, Lemmas, UAS, LAS, CLAS, MLAS, BLEX
- Multi-treebank training: optionally combine multiple treebanks for the same language
- Device support: NVIDIA CUDA and Apple MPS
- WandB integration for experiment tracking
- Checkpoints after each epoch with best model selection
- Full pipeline: train → export → upload to HuggingFace Hub in one command
Installation
uv package manager (if not installed)
macOS / Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
Windows
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
Automatic installation
Create a new virtual environment.
uv venv
source .venv/bin/activate
Install COMBO-NLP.
uv pip install combo-nlp
LAMBO segmenter (optional)
A segmenter is only needed when passing raw text strings to COMBO. If you provide pre-tokenized input (list[str] or list[list[str]]), no segmenter is required.
When you initialize COMBO with a language name (e.g. COMBO("Polish")), it automatically loads a LAMBO segmenter. If LAMBO is not installed, an ImportError is raised. LAMBO is hosted on a custom PyPI index and must be installed separately:
uv pip install --index-url https://pypi.clarin-pl.eu/ lambo
Alternatively, add the custom index to your project's pyproject.toml so that lambo resolves automatically:
[[tool.uv.index]]
url = "https://pypi.clarin-pl.eu/"
Source installation
Clone the repository and install in a virtual environment.
git clone https://gitlab.clarin-pl.eu/syntactic-tools/combo-nlp.git
cd combo-nlp
uv venv
source .venv/bin/activate
uv sync
Basic usage
After installing via pip install combo-nlp, use COMBO directly in Python:
from combo import COMBO
# Load by HuggingFace model ID:
nlp = COMBO.from_pretrained("clarin-pl/combo-nlp-xlm-roberta-base-polish-pbd-ud2.17")
result = nlp("Ala ma kota.")
# Or load by language name:
nlp = COMBO("Polish")
result = nlp("Ala ma kota.")
# Access results:
for sentence in result:
for token in sentence:
print(token.form, token.upos, token.head, token.deprel, token.lemma)
Pre-tokenized input
from combo import COMBO
nlp = COMBO.from_pretrained("clarin-pl/combo-nlp-xlm-roberta-base-polish-pbd-ud2.17")
# Single sentence:
result = nlp(["Ala", "ma", "kota", "."])
# Multiple sentences:
result = nlp([["Ala", "ma", "kota", "."], ["Pies", "je", "."]])
# To parse multiple raw text sentences, join them into a single string:
sentences = ["Ala ma kota.", "Pies je."]
result = nlp("\n".join(sentences))
Environment Setup
Copy the example environment file and fill in your API keys:
cp .env.example .env
Edit .env with your credentials:
WANDB_API_KEY— get it from https://wandb.ai/authorize (required for experiment tracking)HF_TOKEN— create a token with write access at https://huggingface.co/settings/tokens (required for uploading models)
The .env file is loaded automatically by all scripts. It is git-ignored and should never be committed.
Quick Start
Training
# Train with task-specific overrides:
combo-nlp-train --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml
# Resume training from checkpoint:
combo-nlp-train --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml \
--resume /path/to/checkpoint_latest.pt
Evaluation
All evaluation uses the official CoNLL 2018 evaluation script (conll18_ud_eval.py).
The test file can be a .conllu file (aligned evaluation with gold tokenization) or a .txt file (full-text evaluation with automatic segmentation via LAMBO). The model can be loaded from HuggingFace Hub (--model), a local exported directory (--model-dir), or a training checkpoint (--task-config + --checkpoint).
Option 1: Evaluate a model from HuggingFace Hub or local directory
# Evaluate using a HuggingFace model on a CoNLL-U file (aligned evaluation):
combo-nlp-evaluate --model clarin-pl/combo-nlp-polish \
--test-file /path/to/test.conllu
# Evaluate using a HuggingFace model on a plain text file (full-text evaluation):
combo-nlp-evaluate --model clarin-pl/combo-nlp-polish \
--test-file /path/to/test.txt --full-text --language Polish
# Or use a local exported model directory:
combo-nlp-evaluate --model-dir /path/to/model \
--test-file /path/to/test.conllu
# Save predictions to CoNLL-U format while evaluating:
combo-nlp-evaluate --model clarin-pl/combo-nlp-polish \
--test-file /path/to/test.conllu --save-predictions
# Save results to a custom directory:
combo-nlp-evaluate --model clarin-pl/combo-nlp-polish \
--test-file /path/to/test.conllu --output-dir ./results/
Option 2: Evaluate a training checkpoint
# Evaluate with task config (uses best checkpoint and auto-detects test file from UD treebank):
combo-nlp-evaluate --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml
# Evaluate a specific checkpoint on a custom test file:
combo-nlp-evaluate --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml \
--checkpoint /path/to/checkpoint.pt --test-file /path/to/custom_test.conllu
# Full-text evaluation from checkpoint (language auto-detected from task config):
combo-nlp-evaluate --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml \
--full-text
Option 3: Full-text evaluation (end-to-end with segmentation)
Full-text evaluation segments raw text with LAMBO before parsing and measures end-to-end performance including tokenization and segmentation quality.
# From HuggingFace model with a .conllu file (uses adjacent .txt for raw input):
combo-nlp-evaluate --model clarin-pl/combo-nlp-polish \
--test-file /path/to/test.conllu --full-text --language Polish
# From HuggingFace model with a .txt file (auto-resolves matching .conllu for gold scoring):
combo-nlp-evaluate --model clarin-pl/combo-nlp-polish \
--test-file /path/to/test.txt --full-text --language Polish
# From task config (language auto-detected):
combo-nlp-evaluate --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml \
--full-text
Uses the .txt file next to the gold .conllu (standard in UD treebanks) as raw input for LAMBO segmentation. Falls back to # text = metadata if no .txt file exists. Measures both segmentation (Tokens, Sentences, Words) and parsing quality.
Option 4: Compare two CoNLL-U files directly
# Compare gold vs predictions (no model needed):
combo-nlp-evaluate \
--gold-file /path/to/ud-treebanks/UD_Polish-LFG/pl_lfg-ud-test.conllu \
--predictions-file outputs/Polish/results/predictions.conllu
Prediction
# Parse a single sentence using an exported model:
combo-nlp-predict --model-dir /path/to/model \
--text "Ala ma kota ."
# Parse a CoNLL-U file:
combo-nlp-predict --model-dir /path/to/model \
--input input.conllu --output output.conllu
# Interactive mode:
combo-nlp-predict --model-dir /path/to/model
# Or use task-config + checkpoint (during training):
combo-nlp-predict --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml \
--checkpoint data/output/combo-nlp-herbert-base-cased-polish-pdb-ud2.17/checkpoints/checkpoint_best.pt \
--text "Ala ma kota ."
Configuration
Configuration uses a two-level system:
config/base.yaml— shared defaults (architecture, hyperparameters, paths, training settings)config/tasks/<task>.yaml— task-specific overrides (language, treebanks, base model)
Task configs are deep-merged on top of the base config — only specify fields that differ.
config/
├── base.yaml
└── tasks/
├── english.yaml
├── german.yaml
└── ...
Training and the full pipeline require --task-config. Evaluation and prediction can also use --model-dir to load from an exported model directory.
Model Export
Export a trained model to a local directory (simulating HuggingFace repo structure) or push directly to HuggingFace Hub.
# Export to local directory:
combo-nlp-export --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml
# Export and push to HuggingFace Hub:
combo-nlp-export --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml --push-to-hub
The exported directory contains everything needed to load the model:
data/export/<model_name>/
├── config.json # Model configuration
├── pytorch_model.bin # Weights (optimizer state stripped)
├── README.md # Model card (auto-generated from eval results)
└── encoders/
├── morpho_encoder.json
├── deprel_encoder.json
└── char_vocab.json
The model card (README.md) is generated automatically if results_best.json is found in the training output directory.
Full Pipeline
Run the complete workflow — train, export, and upload to HuggingFace Hub — in a single command:
# Full pipeline:
combo-nlp-pipeline --task-config config/tasks/combo-nlp-test-xlm-roberta-base-polish-pdb-ud2.17.yaml
# Dry run (preview commands without executing):
combo-nlp-pipeline --task-config config/tasks/combo-nlp-test-xlm-roberta-base-polish-pdb-ud2.17.yaml --dry-run
# Resume from a specific step (e.g. training already done):
combo-nlp-pipeline --task-config config/tasks/combo-nlp-test-xlm-roberta-base-polish-pdb-ud2.17.yaml --start-from export
# Run only training (no export/upload):
combo-nlp-pipeline --task-config config/tasks/combo-nlp-test-xlm-roberta-base-polish-pdb-ud2.17.yaml --stop-after train
The pipeline stops immediately if any step fails. Steps:
- Train — fine-tune the model with per-epoch evaluation (aligned + full-text) and best model selection
- Export — package the best model for distribution (strip optimizer state, generate model card)
- Upload — push the exported model to HuggingFace Hub
Adding a New Language
To add a new language, create a task config in config/tasks/ and train. See existing configs (e.g., config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml) for reference.
Multi-Treebank Training
When multiple treebanks exist for a language (e.g., Polish has LFG, PDB, PUD, MPDT), the pipeline automatically:
- Combines training data from all specified treebanks
- Combines dev data for validation
- Evaluates on combined test data
This provides more training data and better generalization.
Supported devices:
- NVIDIA CUDA GPUs
- Apple Silicon (MPS)
- CPU (slow, not recommended)
WandB Integration
Training metrics are logged to Weights & Biases:
- train/loss, train/arc_loss, train/rel_loss, train/lemma_loss, etc.: Per-step losses
- train/lr_encoder, train/lr_head: Learning rates
- dev_eval/{metric}: Dev set F1 metrics (aligned evaluation, gold tokenization)
- dev_fulltext/{metric}: Dev set F1 metrics (full-text evaluation, LAMBO segmentation)
- test_eval/..., test_fulltext/...: Same for test set
- train_eval/...: Training subset metrics (aligned)
Metrics include all CoNLL 2018 F1 scores: Tokens, Sentences, Words, UPOS, XPOS, UFeats, AllTags, Lemmas, UAS, LAS, CLAS, MLAS, BLEX. Aligned evaluation uses gold tokenization (Tokens/Sentences/Words always 100%); full-text evaluation measures end-to-end performance including segmentation.
To disable WandB:
wandb:
enabled: false
CLI Commands
After pip install, the following commands are available:
| Command | Description |
|---|---|
combo-nlp-train |
Train a model |
combo-nlp-evaluate |
Evaluate a model |
combo-nlp-predict |
Run predictions |
combo-nlp-export |
Export a trained model |
combo-nlp-pipeline |
Run the full train → export → upload pipeline |
For development without installing, use python scripts/<name>.py instead (e.g. python scripts/train.py).
Testing
Install dev dependencies:
uv sync --extra dev
Run all tests:
PYTHONPATH=src pytest test/ -v
License
Citations
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file combo_nlp-4.0.0.tar.gz.
File metadata
- Download URL: combo_nlp-4.0.0.tar.gz
- Upload date:
- Size: 83.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b2b27bbb3602c2c771c1ed30bd8b886cad000f3a5a8eef2adfe10784d5812798
|
|
| MD5 |
8c5d9a9a0c9aa09f2e3c13d4880cb846
|
|
| BLAKE2b-256 |
048b755244476e6b4e198290638096d96bc240d62e50042467bad4d4c818926d
|
File details
Details for the file combo_nlp-4.0.0-py3-none-any.whl.
File metadata
- Download URL: combo_nlp-4.0.0-py3-none-any.whl
- Upload date:
- Size: 92.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
58717f710e148633f9655c1e054255b07d447b8582c922d1b4fdad89b0e053d6
|
|
| MD5 |
554972d426921fc9fba391f98db332f7
|
|
| BLAKE2b-256 |
a33e8c3a5dd37ed9fd7a33e9c9a09166a2857188a0278e40aa9d0234e6889a36
|