Skip to main content

COMBO-NLP - A library for Morphosyntactic Tagging and Dependency Parsing.

Project description

COMBO-NLP

COMBO-NLP - A library for Morphosyntactic Tagging and Dependency Parsing.

Features

  • Two training modes: full fine-tuning and LoRA (Low-Rank Adaptation) for parameter-efficient training
  • Multi-task learning: morphosyntactic tagging, lemmatization, dependency parsing
  • Combined label encoding: UPOS + XPOS + FEATS as single label for efficient classification
  • Biaffine attention for dependency parsing
  • Character-level seq2seq lemmatization
  • CoNLL 2018 metrics: UPOS, XPOS, UFeats, AllTags, Lemmas, UAS, LAS, CLAS, MLAS, BLEX
  • Multi-treebank training: optionally combine multiple treebanks for the same language
  • Device support: NVIDIA CUDA and Apple MPS
  • WandB integration for experiment tracking
  • Checkpoints after each epoch with best model selection
  • Full pipeline: train → export → upload to HuggingFace Hub in one command

Installation

uv package manager (if not installed)

macOS / Linux

curl -LsSf https://astral.sh/uv/install.sh | sh

Windows

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Automatic installation

Create a new virtual environment.

uv venv
source .venv/bin/activate

Install COMBO-NLP.

uv pip install combo-nlp

LAMBO segmenter (optional)

A segmenter is only needed when passing raw text strings to COMBO. If you provide pre-tokenized input (list[str] or list[list[str]]), no segmenter is required.

When you initialize COMBO with a language name (e.g. COMBO("Polish")), it automatically loads a LAMBO segmenter. If LAMBO is not installed, an ImportError is raised. LAMBO is hosted on a custom PyPI index and must be installed separately:

uv pip install --index-url https://pypi.clarin-pl.eu/ lambo

Alternatively, add the custom index to your project's pyproject.toml so that lambo resolves automatically:

[[tool.uv.index]]
url = "https://pypi.clarin-pl.eu/"

Source installation

Clone the repository and install in a virtual environment.

git clone https://gitlab.clarin-pl.eu/syntactic-tools/combo-nlp.git
cd combo-nlp
uv venv
source .venv/bin/activate
uv sync

Basic usage

After installing via pip install combo-nlp, use COMBO directly in Python:

from combo import COMBO

# Load by HuggingFace model ID:
nlp = COMBO.from_pretrained("clarin-pl/combo-nlp-xlm-roberta-base-polish-pbd-ud2.17")
result = nlp("Ala ma kota.")

# Or load by language name:
nlp = COMBO("Polish")
result = nlp("Ala ma kota.")

# Access results:
for sentence in result:
    for token in sentence:
        print(token.form, token.upos, token.head, token.deprel, token.lemma)

Pre-tokenized input

from combo import COMBO

nlp = COMBO.from_pretrained("clarin-pl/combo-nlp-xlm-roberta-base-polish-pbd-ud2.17")

# Single sentence:
result = nlp(["Ala", "ma", "kota", "."])

# Multiple sentences:
result = nlp([["Ala", "ma", "kota", "."], ["Pies", "je", "."]])

# To parse multiple raw text sentences, join them into a single string:
sentences = ["Ala ma kota.", "Pies je."]
result = nlp("\n".join(sentences))

Environment Setup

Copy the example environment file and fill in your API keys:

cp .env.example .env

Edit .env with your credentials:

The .env file is loaded automatically by all scripts. It is git-ignored and should never be committed.

Quick Start

Training

# Train with task-specific overrides:
combo-nlp-train --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml

# Resume training from checkpoint:
combo-nlp-train --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml \
    --resume /path/to/checkpoint_latest.pt

Evaluation

All evaluation uses the official CoNLL 2018 evaluation script (conll18_ud_eval.py).

The test file can be a .conllu file (aligned evaluation with gold tokenization) or a .txt file (full-text evaluation with automatic segmentation via LAMBO). The model can be loaded from HuggingFace Hub (--model), a local exported directory (--model-dir), or a training checkpoint (--task-config + --checkpoint).

Option 1: Evaluate a model from HuggingFace Hub or local directory

# Evaluate using a HuggingFace model on a CoNLL-U file (aligned evaluation):
combo-nlp-evaluate --model clarin-pl/combo-nlp-polish \
    --test-file /path/to/test.conllu

# Evaluate using a HuggingFace model on a plain text file (full-text evaluation):
combo-nlp-evaluate --model clarin-pl/combo-nlp-polish \
    --test-file /path/to/test.txt --full-text --language Polish

# Or use a local exported model directory:
combo-nlp-evaluate --model-dir /path/to/model \
    --test-file /path/to/test.conllu

# Save predictions to CoNLL-U format while evaluating:
combo-nlp-evaluate --model clarin-pl/combo-nlp-polish \
    --test-file /path/to/test.conllu --save-predictions

# Save results to a custom directory:
combo-nlp-evaluate --model clarin-pl/combo-nlp-polish \
    --test-file /path/to/test.conllu --output-dir ./results/

Option 2: Evaluate a training checkpoint

# Evaluate with task config (uses best checkpoint and auto-detects test file from UD treebank):
combo-nlp-evaluate --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml

# Evaluate a specific checkpoint on a custom test file:
combo-nlp-evaluate --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml \
    --checkpoint /path/to/checkpoint.pt --test-file /path/to/custom_test.conllu

# Full-text evaluation from checkpoint (language auto-detected from task config):
combo-nlp-evaluate --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml \
    --full-text

Option 3: Full-text evaluation (end-to-end with segmentation)

Full-text evaluation segments raw text with LAMBO before parsing and measures end-to-end performance including tokenization and segmentation quality.

# From HuggingFace model with a .conllu file (uses adjacent .txt for raw input):
combo-nlp-evaluate --model clarin-pl/combo-nlp-polish \
    --test-file /path/to/test.conllu --full-text --language Polish

# From HuggingFace model with a .txt file (auto-resolves matching .conllu for gold scoring):
combo-nlp-evaluate --model clarin-pl/combo-nlp-polish \
    --test-file /path/to/test.txt --full-text --language Polish

# From task config (language auto-detected):
combo-nlp-evaluate --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml \
    --full-text

Uses the .txt file next to the gold .conllu (standard in UD treebanks) as raw input for LAMBO segmentation. Falls back to # text = metadata if no .txt file exists. Measures both segmentation (Tokens, Sentences, Words) and parsing quality.

Option 4: Compare two CoNLL-U files directly

# Compare gold vs predictions (no model needed):
combo-nlp-evaluate \
    --gold-file /path/to/ud-treebanks/UD_Polish-LFG/pl_lfg-ud-test.conllu \
    --predictions-file outputs/Polish/results/predictions.conllu

Prediction

# Parse a single sentence using an exported model:
combo-nlp-predict --model-dir /path/to/model \
    --text "Ala ma kota ."

# Parse a CoNLL-U file:
combo-nlp-predict --model-dir /path/to/model \
    --input input.conllu --output output.conllu

# Interactive mode:
combo-nlp-predict --model-dir /path/to/model

# Or use task-config + checkpoint (during training):
combo-nlp-predict --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml \
    --checkpoint data/output/combo-nlp-herbert-base-cased-polish-pdb-ud2.17/checkpoints/checkpoint_best.pt \
    --text "Ala ma kota ."

Configuration

Configuration uses a two-level system:

  • config/base.yaml — shared defaults (architecture, hyperparameters, paths, training settings)
  • config/tasks/<task>.yaml — task-specific overrides (language, treebanks, base model)

Task configs are deep-merged on top of the base config — only specify fields that differ.

config/
├── base.yaml
└── tasks/
    ├── english.yaml
    ├── german.yaml
    └── ...

Training and the full pipeline require --task-config. Evaluation and prediction can also use --model-dir to load from an exported model directory.

Model Export

Export a trained model to a local directory (simulating HuggingFace repo structure) or push directly to HuggingFace Hub.

# Export to local directory:
combo-nlp-export --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml

# Export and push to HuggingFace Hub:
combo-nlp-export --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml --push-to-hub

The exported directory contains everything needed to load the model:

data/export/<model_name>/
├── config.json          # Model configuration
├── pytorch_model.bin    # Weights (optimizer state stripped)
├── README.md            # Model card (auto-generated from eval results)
└── encoders/
    ├── morpho_encoder.json
    ├── deprel_encoder.json
    └── char_vocab.json

The model card (README.md) is generated automatically if results_best.json is found in the training output directory.

Full Pipeline

Run the complete workflow — train, export, and upload to HuggingFace Hub — in a single command:

# Full pipeline:
combo-nlp-pipeline --task-config config/tasks/combo-nlp-test-xlm-roberta-base-polish-pdb-ud2.17.yaml

# Dry run (preview commands without executing):
combo-nlp-pipeline --task-config config/tasks/combo-nlp-test-xlm-roberta-base-polish-pdb-ud2.17.yaml --dry-run

# Resume from a specific step (e.g. training already done):
combo-nlp-pipeline --task-config config/tasks/combo-nlp-test-xlm-roberta-base-polish-pdb-ud2.17.yaml --start-from export

# Run only training (no export/upload):
combo-nlp-pipeline --task-config config/tasks/combo-nlp-test-xlm-roberta-base-polish-pdb-ud2.17.yaml --stop-after train

The pipeline stops immediately if any step fails. Steps:

  1. Train — fine-tune the model with per-epoch evaluation (aligned + full-text) and best model selection
  2. Export — package the best model for distribution (strip optimizer state, generate model card)
  3. Upload — push the exported model to HuggingFace Hub

Adding a New Language

To add a new language, create a task config in config/tasks/ and train. See existing configs (e.g., config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml) for reference.

Multi-Treebank Training

When multiple treebanks exist for a language (e.g., Polish has LFG, PDB, PUD, MPDT), the pipeline automatically:

  1. Combines training data from all specified treebanks
  2. Combines dev data for validation
  3. Evaluates on combined test data

This provides more training data and better generalization.

Supported devices:

  • NVIDIA CUDA GPUs
  • Apple Silicon (MPS)
  • CPU (slow, not recommended)

WandB Integration

Training metrics are logged to Weights & Biases:

  • train/loss, train/arc_loss, train/rel_loss, train/lemma_loss, etc.: Per-step losses
  • train/lr_encoder, train/lr_head: Learning rates
  • dev_eval/{metric}: Dev set F1 metrics (aligned evaluation, gold tokenization)
  • dev_fulltext/{metric}: Dev set F1 metrics (full-text evaluation, LAMBO segmentation)
  • test_eval/..., test_fulltext/...: Same for test set
  • train_eval/...: Training subset metrics (aligned)

Metrics include all CoNLL 2018 F1 scores: Tokens, Sentences, Words, UPOS, XPOS, UFeats, AllTags, Lemmas, UAS, LAS, CLAS, MLAS, BLEX. Aligned evaluation uses gold tokenization (Tokens/Sentences/Words always 100%); full-text evaluation measures end-to-end performance including segmentation.

To disable WandB:

wandb:
  enabled: false

CLI Commands

After pip install, the following commands are available:

Command Description
combo-nlp-train Train a model
combo-nlp-evaluate Evaluate a model
combo-nlp-predict Run predictions
combo-nlp-export Export a trained model
combo-nlp-pipeline Run the full train → export → upload pipeline

For development without installing, use python scripts/<name>.py instead (e.g. python scripts/train.py).

Testing

Install dev dependencies:

uv sync --extra dev

Run all tests:

PYTHONPATH=src pytest test/ -v

License

Citations

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

combo_nlp-4.0.0.tar.gz (83.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

combo_nlp-4.0.0-py3-none-any.whl (92.4 kB view details)

Uploaded Python 3

File details

Details for the file combo_nlp-4.0.0.tar.gz.

File metadata

  • Download URL: combo_nlp-4.0.0.tar.gz
  • Upload date:
  • Size: 83.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for combo_nlp-4.0.0.tar.gz
Algorithm Hash digest
SHA256 b2b27bbb3602c2c771c1ed30bd8b886cad000f3a5a8eef2adfe10784d5812798
MD5 8c5d9a9a0c9aa09f2e3c13d4880cb846
BLAKE2b-256 048b755244476e6b4e198290638096d96bc240d62e50042467bad4d4c818926d

See more details on using hashes here.

File details

Details for the file combo_nlp-4.0.0-py3-none-any.whl.

File metadata

  • Download URL: combo_nlp-4.0.0-py3-none-any.whl
  • Upload date:
  • Size: 92.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for combo_nlp-4.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 58717f710e148633f9655c1e054255b07d447b8582c922d1b4fdad89b0e053d6
MD5 554972d426921fc9fba391f98db332f7
BLAKE2b-256 a33e8c3a5dd37ed9fd7a33e9c9a09166a2857188a0278e40aa9d0234e6889a36

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page