Skip to main content

WorkRB: Work Research Benchmark. Easy benchmarking of AI progress in the work domain.

Project description

WorkRB

Easy benchmarking of AI progress in the work domain

syntax checking GitHub release License

Installation | Features | Usage Guide | Contributing | Citing

WorkRB (~pronounced worker bee) is an open-source evaluation toolbox for benchmarking AI models in the work research domain. It provides a standardized framework that is easy to use and community-driven, scaling evaluation over a wide range of tasks, ontologies, and models.

Features

  • 🐝 One Buzzing Work Toolkit — Easily download & access ontologies, datasets, and baselines in a single toolkit
  • 🧪 Extensive tasks — Evaluate models on job–skill matching, normalization, extraction, and similarity
  • 🌍 Dynamic Multilinguality — Evaluate over languages driven by multilingual ontologies
  • 🧠 Ready-to-go Baselines — Leverage provided baseline models for comparison
  • 🧩 Extensible design — Add your custom tasks and models with simple interfaces

Example Usage

import workrb

# 1. Initialize a model
model = workrb.models.BiEncoderModel("all-MiniLM-L6-v2")

# 2. Select (multilingual) tasks to evaluate
tasks = [
    workrb.tasks.ESCOJob2SkillRanking(split="val", languages=["en"]),
    workrb.tasks.ESCOSkillNormRanking(split="val", languages=["de", "fr"])
]

# 3. Run benchmark & view results
results = workrb.evaluate(
    model,
    tasks,
    output_folder="results/my_model",
)
print(results)

Installation

Install WorkRB simply via pip.

pip install workrb

Requirements: Python 3.10+, see pyproject.toml for all dependencies.

Usage Guide

This section covers common usage patterns. Table of Contents:

Custom Tasks & Models

Add your custom task or model by (1) inheriting from a predefined base class and implementing the abstract methods, and (2) adding it to the registry:

  • Custom Tasks: Inherit from RankingTask, MultilabelClassificationTask,... Implement the abstract methods. Register via @register_task().
  • Custom models: Inherit from ModelInterface. Implement the abstract methods. Register via @register_model().
from workrb.tasks.abstract.ranking_base import RankingTask
from workrb.models.base import ModelInterface
from workrb.registry import register_task, register_model

@register_task()
class MyCustomTask(RankingTask):
    name: str = "MyCustomTask"
    ...


@register_model()
class MyCustomModel(ModelInterface):
    name: str = "MyCustomModel"
    ...

# Use your custom model and task:
model_results = workrb.evaluate(MyCustomModel(),[MyCustomTask()])

For detailed examples, see:

Feel free to make a PR to add your models & tasks to the official package! See CONTRIBUTING guidelines for details.

Checkpointing & Resuming

WorkRB automatically saves result checkpoints after each task completion in a specific language.

Automatic Resuming - Simply rerun with the same output_folder:

# Run 1: Gets interrupted after 2 tasks
tasks = [
    workrb.tasks.ESCOJob2SkillRanking(
        split="val", 
        languages=["en"],
    )
]

results = workrb.evaluate(model, tasks, output_folder="results/my_model")

# Run 2: Automatically resumes from checkpoint
results = workrb.evaluate(model, tasks, output_folder="results/my_model")
# ✓ Skips completed tasks, continues from where it left off

Extending Benchmarks - Want to extend your results with additional tasks or languages? Add the new tasks or languages when resuming:

# Resume from previous & extend with new task and languages
tasks_extended = [
    workrb.tasks.ESCOJob2SkillRanking( # Add de, fr
        split="val", 
        languages=["en", "de", "fr"]), 
    workrb.tasks.ESCOJob2SkillRanking( # Add new task
        split="val", 
        languages=["en"],
]
results = workrb.evaluate(model, tasks, output_folder="results/my_model")
# ✓ Reuses English results, only evaluates new languages/tasks

You cannot reduce scope when resuming. This is by design to avoid ambiguity. Finished tasks in the checkpoint should also be included in your WorkRB initialization. If you want to start fresh in the same output folder, use force_restart=True:

results = workrb.evaluate(model, tasks, output_folder="results/my_model", force_restart=True)

Results & Metric Aggregation

Results are automatically saved to your output_folder:

results/my_model/
├── checkpoint.json       # Incremental checkpoint (for resuming)
├── results.json          # Final results dump
└── config.yaml           # Final benchmark configuration dump

To load & parse results from a run:

results = workrb.load_results("results/my_model/results.json")
print(results)

Metrics: The main benchmark metrics mean_benchmark/<metric>/mean require 4 aggregation steps:

  1. First, Macro-average languages per task (e.g. ESCOJob2SkillRanking) (mean_per_task/<task_name>/<metric>/mean)
  2. Macro-average tasks per task group (e.g. Job2SkillRanking) (mean_per_task_group/<group>/<metric>/mean)
  3. Macro-average task groups per task type (e.g. RankingTask, ClassificationTask) mean_per_task_type/<type>/<metric>/mean
  4. Macro-average over task types.

Per-language performance is also available: mean_per_language/<lang>/<metric>/mean. Each aggregation provides 95% confidence intervals (replace mean with ci_margin)

# Benchmark returns a detailed Pydantic model
results: BenchmarkResults = workrb.evaluate(...)

# Calculate aggregated metrics
summary: dict[str, float] = results.get_summary_metrics()

# Show all results
print(summary)
print(results) # Equivalent: internally runs get_summary_metrics()

# Access metric via tag
lang_result = summary["mean_per_language/en/f1_macro/mean"]
lang_result_ci = summary["mean_per_language/en/f1_macro/ci_margin"]

Supported tasks & models

Tasks

Task Name Label Type Dataset Size (English) Languages
Ranking
Job to Skills WorkBench multi_label 3039 queries x 13939 targets 28
Job Title Similarity multi_label 105 queries x 2619 targets 11
Job Normalization single_label 15463 queries x 2942 targets 28
Skill to Job WorkBench multi_label 13492 queries x 3039 targets 28
Skill Extraction House multi_label 262 queries x 13891 targets 28
Skill Extraction Tech multi_label 338 queries x 13891 targets 28
Skill Extraction SkillSkape multi_label 1191 queries x 13891 targets 28
Skill Similarity SkillMatch-1K single_label 900 queries x 2648 targets 1
Skill Normalization ESCO multi_label 72008 queries x 13939 targets 28
Classification
Job-Skill Classification multi_label 3039 samples, 13939 classes 28

Models

Model Name Description Adaptive Targets
BiEncoderModel BiEncoder model using sentence-transformers for ranking and classification tasks.
JobBERTModel Job-Normalization BiEncoder from Techwolf: https://huggingface.co/TechWolf/JobBERT-v2
ConTeXTMatchModel ConTeXT-Skill-Extraction-base from Techwolf: https://huggingface.co/TechWolf/ConTeXT-Skill-Extraction-base
CurriculumMatchModel CurriculumMatch bi-encoder from Aleksandruz: https://huggingface.co/Aleksandruz/skillmatch-mpnet-curriculum-retriever
RndESCOClassificationModel Random baseline for multi-label classification with random prediction head for ESCO.

Contributing

Want to contribute new tasks, models, or metrics? Read our CONTRIBUTING.md guide for all details.

Development environment

# Clone repository
git clone https://github.com/techwolf-ai/workrb.git && cd workrb

# Create and install a virtual environment
uv sync --all-extras

# Activate the virtual environment
source .venv/bin/activate

# Install the pre-commit hooks
pre-commit install --install-hooks

# Run tests (excludes model benchmarking by default)
uv run poe test

# Run model benchmark tests only, checks reproducibility of original results
uv run poe test-benchmark
Developing details
  • This project follows the Conventional Commits standard to automate Semantic Versioning and Keep A Changelog with Commitizen.
  • Run poe from within the development environment to print a list of Poe the Poet tasks available to run on this project.
  • Run uv add {package} from within the development environment to install a run time dependency and add it to pyproject.toml and uv.lock. Add --dev to install a development dependency.
  • Run uv sync --upgrade from within the development environment to upgrade all dependencies to the latest versions allowed by pyproject.toml. Add --only-dev to upgrade the development dependencies only.
  • Run cz bump to bump the package's version, update the CHANGELOG.md, and create a git tag. Then push the changes and the git tag with git push origin main --tags.

Citation

WorkRB builds upon the unifying WorkBench benchmark, consider citing:
@misc{delange2025unifiedworkembeddings,
      title={Unified Work Embeddings: Contrastive Learning of a Bidirectional Multi-task Ranker}, 
      author={Matthias De Lange and Jens-Joris Decorte and Jeroen Van Hautte},
      year={2025},
      eprint={2511.07969},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2511.07969}, 
}
WorkRB has a community paper coming up! WIP

License

Apache 2.0 License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

workrb-0.3.0.tar.gz (82.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

workrb-0.3.0-py3-none-any.whl (71.9 kB view details)

Uploaded Python 3

File details

Details for the file workrb-0.3.0.tar.gz.

File metadata

  • Download URL: workrb-0.3.0.tar.gz
  • Upload date:
  • Size: 82.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.22 {"installer":{"name":"uv","version":"0.9.22","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for workrb-0.3.0.tar.gz
Algorithm Hash digest
SHA256 f41f39e22f1a85087ab58afb4f9f11051624f02cd32928f21da800e8a95d6ca1
MD5 79a0b4adb05b023b83f3c03aa5992324
BLAKE2b-256 e2daf964df5a6d6bd61547711d6b6386ce8d78c4c38f55156c415a962990c28b

See more details on using hashes here.

File details

Details for the file workrb-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: workrb-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 71.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.22 {"installer":{"name":"uv","version":"0.9.22","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for workrb-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 692721f5ccb487236b31c6b7a6379da484c9d233c054ba40ee4ef2d133d91196
MD5 89d42f20197ad01417d05e5786ace370
BLAKE2b-256 febe6fa2eb2a0d49c7dcf0f55b8410d0a6c7085bb8406bc7855d77a1e1d845f9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page