COMBO-NLP - A library for Morphosyntactic Tagging and Dependency Parsing.

These details have not been verified by PyPI

Project links

Project description

COMBO-NLP

COMBO-NLP - A library for Morphosyntactic Tagging and Dependency Parsing.

Features

Two training modes: full fine-tuning and LoRA (Low-Rank Adaptation) for parameter-efficient training
Multi-task learning: morphosyntactic tagging, lemmatization, dependency parsing
Combined label encoding: UPOS + XPOS + FEATS as single label for efficient classification
Biaffine attention for dependency parsing
Character-level seq2seq lemmatization
CoNLL 2018 metrics: UPOS, XPOS, UFeats, AllTags, Lemmas, UAS, LAS, CLAS, MLAS, BLEX
Multi-treebank training: optionally combine multiple treebanks for the same language
Device support: NVIDIA CUDA and Apple MPS
WandB integration for experiment tracking
Checkpoints after each epoch with best model selection
Full pipeline: train → export → upload to HuggingFace Hub in one command

Installation

uv package manager (if not installed)

macOS / Linux

curl -LsSf https://astral.sh/uv/install.sh | sh

Windows

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Automatic installation

Create a new virtual environment.

uv venv
source .venv/bin/activate

Install COMBO-NLP.

uv pip install combo-nlp

LAMBO segmenter (optional)

A segmenter is only needed when passing raw text strings to COMBO. If you provide pre-tokenized input (list[str] or list[list[str]]), no segmenter is required.

When you initialize COMBO with a language name (e.g. COMBO("Polish")), it automatically loads a LAMBO segmenter. If LAMBO is not installed, an ImportError is raised. LAMBO is hosted on a custom PyPI index and must be installed separately:

uv pip install --index-url https://pypi.clarin-pl.eu/ lambo

Alternatively, add the custom index to your project's pyproject.toml so that lambo resolves automatically:

[[tool.uv.index]]
url = "https://pypi.clarin-pl.eu/"

Source installation

Clone the repository and install in a virtual environment.

git clone https://gitlab.clarin-pl.eu/syntactic-tools/combo-nlp.git
cd combo-nlp
uv venv
source .venv/bin/activate
uv sync

Basic usage

After installing via pip install combo-nlp, use COMBO directly in Python:

from combo import COMBO

# Load by HuggingFace model ID:
nlp = COMBO.from_pretrained("clarin-pl/combo-nlp-xlm-roberta-base-polish-pbd-ud2.17")
result = nlp("Ala ma kota.")

# Or load by language name:
nlp = COMBO("Polish")
result = nlp("Ala ma kota.")

# Access results:
for sentence in result:
    for token in sentence:
        print(token.form, token.upos, token.head, token.deprel, token.lemma)

Pre-tokenized input

from combo import COMBO

nlp = COMBO.from_pretrained("clarin-pl/combo-nlp-xlm-roberta-base-polish-pbd-ud2.17")

# Single sentence:
result = nlp(["Ala", "ma", "kota", "."])

# Multiple sentences:
result = nlp([["Ala", "ma", "kota", "."], ["Pies", "je", "."]])

# To parse multiple raw text sentences, join them into a single string:
sentences = ["Ala ma kota.", "Pies je."]
result = nlp("\n".join(sentences))

Environment Setup

Copy the example environment file and fill in your API keys:

cp .env.example .env

Edit .env with your credentials:

WANDB_API_KEY — get it from https://wandb.ai/authorize (required for experiment tracking)
HF_TOKEN — create a token with write access at https://huggingface.co/settings/tokens (required for uploading models)

The .env file is loaded automatically by all scripts. It is git-ignored and should never be committed.

Quick Start

Training

# Train with task-specific overrides:
combo-nlp-train --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml

# Resume training from checkpoint:
combo-nlp-train --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml \
    --resume /path/to/checkpoint_latest.pt

Evaluation

All evaluation uses the official CoNLL 2018 evaluation script (conll18_ud_eval.py).

The test file can be a .conllu file (aligned evaluation with gold tokenization) or a .txt file (full-text evaluation with automatic segmentation via LAMBO). The model can be loaded from HuggingFace Hub (--model), a local exported directory (--model-dir), or a training checkpoint (--task-config + --checkpoint).

Option 1: Evaluate a model from HuggingFace Hub or local directory

# Evaluate using a HuggingFace model on a CoNLL-U file (aligned evaluation):
combo-nlp-evaluate --model clarin-pl/combo-nlp-polish \
    --test-file /path/to/test.conllu

# Evaluate using a HuggingFace model on a plain text file (full-text evaluation):
combo-nlp-evaluate --model clarin-pl/combo-nlp-polish \
    --test-file /path/to/test.txt --full-text --language Polish

# Or use a local exported model directory:
combo-nlp-evaluate --model-dir /path/to/model \
    --test-file /path/to/test.conllu

# Save predictions to CoNLL-U format while evaluating:
combo-nlp-evaluate --model clarin-pl/combo-nlp-polish \
    --test-file /path/to/test.conllu --save-predictions

# Save results to a custom directory:
combo-nlp-evaluate --model clarin-pl/combo-nlp-polish \
    --test-file /path/to/test.conllu --output-dir ./results/

Option 2: Evaluate a training checkpoint

# Evaluate with task config (uses best checkpoint and auto-detects test file from UD treebank):
combo-nlp-evaluate --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml

# Evaluate a specific checkpoint on a custom test file:
combo-nlp-evaluate --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml \
    --checkpoint /path/to/checkpoint.pt --test-file /path/to/custom_test.conllu

# Full-text evaluation from checkpoint (language auto-detected from task config):
combo-nlp-evaluate --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml \
    --full-text

Option 3: Full-text evaluation (end-to-end with segmentation)

Full-text evaluation segments raw text with LAMBO before parsing and measures end-to-end performance including tokenization and segmentation quality.

# From HuggingFace model with a .conllu file (uses adjacent .txt for raw input):
combo-nlp-evaluate --model clarin-pl/combo-nlp-polish \
    --test-file /path/to/test.conllu --full-text --language Polish

# From HuggingFace model with a .txt file (auto-resolves matching .conllu for gold scoring):
combo-nlp-evaluate --model clarin-pl/combo-nlp-polish \
    --test-file /path/to/test.txt --full-text --language Polish

# From task config (language auto-detected):
combo-nlp-evaluate --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml \
    --full-text

Uses the .txt file next to the gold .conllu (standard in UD treebanks) as raw input for LAMBO segmentation. Falls back to # text = metadata if no .txt file exists. Measures both segmentation (Tokens, Sentences, Words) and parsing quality.

Option 4: Compare two CoNLL-U files directly

# Compare gold vs predictions (no model needed):
combo-nlp-evaluate \
    --gold-file /path/to/ud-treebanks/UD_Polish-LFG/pl_lfg-ud-test.conllu \
    --predictions-file outputs/Polish/results/predictions.conllu

Prediction

# Parse a single sentence using an exported model:
combo-nlp-predict --model-dir /path/to/model \
    --text "Ala ma kota ."

# Parse a CoNLL-U file:
combo-nlp-predict --model-dir /path/to/model \
    --input input.conllu --output output.conllu

# Interactive mode:
combo-nlp-predict --model-dir /path/to/model

# Or use task-config + checkpoint (during training):
combo-nlp-predict --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml \
    --checkpoint data/output/combo-nlp-herbert-base-cased-polish-pdb-ud2.17/checkpoints/checkpoint_best.pt \
    --text "Ala ma kota ."

Configuration

Configuration uses a two-level system:

config/base.yaml — shared defaults (architecture, hyperparameters, paths, training settings)
config/tasks/<task>.yaml — task-specific overrides (language, treebanks, base model)

Task configs are deep-merged on top of the base config — only specify fields that differ.

config/
├── base.yaml
└── tasks/
    ├── english.yaml
    ├── german.yaml
    └── ...

Training and the full pipeline require --task-config. Evaluation and prediction can also use --model-dir to load from an exported model directory.

Model Export

Export a trained model to a local directory (simulating HuggingFace repo structure) or push directly to HuggingFace Hub.

# Export to local directory:
combo-nlp-export --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml

# Export and push to HuggingFace Hub:
combo-nlp-export --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml --push-to-hub

The exported directory contains everything needed to load the model:

data/export/<model_name>/
├── config.json          # Model configuration
├── pytorch_model.bin    # Weights (optimizer state stripped)
├── README.md            # Model card (auto-generated from eval results)
└── encoders/
    ├── morpho_encoder.json
    ├── deprel_encoder.json
    └── char_vocab.json

The model card (README.md) is generated automatically if results_best.json is found in the training output directory.

Full Pipeline

Run the complete workflow — train, export, and upload to HuggingFace Hub — in a single command:

# Full pipeline:
combo-nlp-pipeline --task-config config/tasks/combo-nlp-test-xlm-roberta-base-polish-pdb-ud2.17.yaml

# Dry run (preview commands without executing):
combo-nlp-pipeline --task-config config/tasks/combo-nlp-test-xlm-roberta-base-polish-pdb-ud2.17.yaml --dry-run

# Resume from a specific step (e.g. training already done):
combo-nlp-pipeline --task-config config/tasks/combo-nlp-test-xlm-roberta-base-polish-pdb-ud2.17.yaml --start-from export

# Run only training (no export/upload):
combo-nlp-pipeline --task-config config/tasks/combo-nlp-test-xlm-roberta-base-polish-pdb-ud2.17.yaml --stop-after train

The pipeline stops immediately if any step fails. Steps:

Train — fine-tune the model with per-epoch evaluation (aligned + full-text) and best model selection
Export — package the best model for distribution (strip optimizer state, generate model card)
Upload — push the exported model to HuggingFace Hub

Adding a New Language

To add a new language, create a task config in config/tasks/ and train. See existing configs (e.g., config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml) for reference.

Multi-Treebank Training

When multiple treebanks exist for a language (e.g., Polish has LFG, PDB, PUD, MPDT), the pipeline automatically:

Combines training data from all specified treebanks
Combines dev data for validation
Evaluates on combined test data

This provides more training data and better generalization.

Supported devices:

NVIDIA CUDA GPUs
Apple Silicon (MPS)
CPU (slow, not recommended)

WandB Integration

Training metrics are logged to Weights & Biases:

train/loss, train/arc_loss, train/rel_loss, train/lemma_loss, etc.: Per-step losses
train/lr_encoder, train/lr_head: Learning rates
dev_eval/{metric}: Dev set F1 metrics (aligned evaluation, gold tokenization)
dev_fulltext/{metric}: Dev set F1 metrics (full-text evaluation, LAMBO segmentation)
test_eval/..., test_fulltext/...: Same for test set
train_eval/...: Training subset metrics (aligned)

Metrics include all CoNLL 2018 F1 scores: Tokens, Sentences, Words, UPOS, XPOS, UFeats, AllTags, Lemmas, UAS, LAS, CLAS, MLAS, BLEX. Aligned evaluation uses gold tokenization (Tokens/Sentences/Words always 100%); full-text evaluation measures end-to-end performance including segmentation.

To disable WandB:

wandb:
  enabled: false

CLI Commands

After pip install, the following commands are available:

Command	Description
`combo-nlp-train`	Train a model
`combo-nlp-evaluate`	Evaluate a model
`combo-nlp-predict`	Run predictions
`combo-nlp-export`	Export a trained model
`combo-nlp-pipeline`	Run the full train → export → upload pipeline

For development without installing, use python scripts/<name>.py instead (e.g. python scripts/train.py).

Testing

Install dev dependencies:

uv sync --extra dev

Run all tests:

PYTHONPATH=src pytest test/ -v

License

Citations

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

4.0.9

Apr 27, 2026

4.0.8

Apr 27, 2026

4.0.7

Apr 27, 2026

4.0.6

Apr 19, 2026

4.0.5

Mar 25, 2026

4.0.4

Mar 25, 2026

4.0.3

Mar 23, 2026

4.0.2

Mar 19, 2026

4.0.1

Mar 14, 2026

4.0.0.post1

Mar 14, 2026

This version

4.0.0

Mar 14, 2026

3.1.1

Jan 31, 2024

3.0.4

Jan 18, 2024

3.0.3

Jan 18, 2024

3.0.1

Jan 16, 2024

3.0.0

Jan 16, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

combo_nlp-4.0.0.tar.gz (83.5 kB view details)

Uploaded Mar 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

combo_nlp-4.0.0-py3-none-any.whl (92.4 kB view details)

Uploaded Mar 14, 2026 Python 3

File details

Details for the file combo_nlp-4.0.0.tar.gz.

File metadata

Download URL: combo_nlp-4.0.0.tar.gz
Upload date: Mar 14, 2026
Size: 83.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for combo_nlp-4.0.0.tar.gz
Algorithm	Hash digest
SHA256	`b2b27bbb3602c2c771c1ed30bd8b886cad000f3a5a8eef2adfe10784d5812798`
MD5	`8c5d9a9a0c9aa09f2e3c13d4880cb846`
BLAKE2b-256	`048b755244476e6b4e198290638096d96bc240d62e50042467bad4d4c818926d`

See more details on using hashes here.

File details

Details for the file combo_nlp-4.0.0-py3-none-any.whl.

File metadata

Download URL: combo_nlp-4.0.0-py3-none-any.whl
Upload date: Mar 14, 2026
Size: 92.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for combo_nlp-4.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`58717f710e148633f9655c1e054255b07d447b8582c922d1b4fdad89b0e053d6`
MD5	`554972d426921fc9fba391f98db332f7`
BLAKE2b-256	`a33e8c3a5dd37ed9fd7a33e9c9a09166a2857188a0278e40aa9d0234e6889a36`

See more details on using hashes here.

combo-nlp 4.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

COMBO-NLP

Features

Installation

uv package manager (if not installed)

Automatic installation

LAMBO segmenter (optional)

Source installation

Basic usage

Pre-tokenized input

Environment Setup

Quick Start

Training

Evaluation

Option 1: Evaluate a model from HuggingFace Hub or local directory

Option 2: Evaluate a training checkpoint

Option 3: Full-text evaluation (end-to-end with segmentation)

Option 4: Compare two CoNLL-U files directly

Prediction

Configuration

Model Export

Full Pipeline

Adding a New Language

Multi-Treebank Training

WandB Integration

CLI Commands

Testing

License

Citations

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes