Skip to main content

A collection of ukrainian language datasets

Project description

NaUKMA FIdo Logo

ua_datasets

PyPI version Python versions License Downloads

Build CI Code size Code style: Ruff Type checking: mypy

UA-datasets provides ready-to-use Ukrainian NLP benchmark datasets with a single, lightweight Python API.

Fast access to Question Answering, News Classification, and POS Tagging corpora — with automatic download, caching, and consistent iteration.

Why use this library?

  • Unified API: All datasets expose len(ds), indexing, iteration, and simple frequency helpers.
  • Robust downloads: Automatic retries, integrity guards, and filename fallbacks for legacy splits.
  • Zero heavy deps: Pure Python + standard library (core loaders) for quick startup.
  • Repro friendly: Validation split for UA-SQuAD; classification CSV parsing with resilience to minor format drift.
  • Tooling ready: Works seamlessly with ruff, mypy, pytest, and uv-based workflows.

Maintained by the FIdo.ai research group (National University of Kyiv-Mohyla Academy).

Minimal Example

# Assumes `uv` workspace already synced with `uv sync` and project installed.

from pathlib import Path
from ua_datasets.question_answering import UaSquadDataset
from ua_datasets.text_classification import NewsClassificationDataset
from ua_datasets.token_classification import MovaInstitutePOSDataset

# Question Answering (first HF-style example dict)
qa = UaSquadDataset(root=Path("./data/ua_squad"), split="train", download=True)
print("QA examples:", len(qa))
example = qa[0]
print(example.keys())  # id, title, context, question, answers, is_impossible
print(example["question"], "->", example["answers"]["text"])  # list of accepted answers

# News Classification
news = NewsClassificationDataset(root=Path("./data/ua_news"), split="train", download=True)
title, text, target, tags = news[0]
print("Label count:", len(news.labels), "First label:", target)

# Part-of-Speech Tagging
pos = MovaInstitutePOSDataset(root=Path("./data/mova_pos"), download=True)
tokens, tags = pos[0]
print(tokens[:8], tags[:8])

For development commands see the Installation section below.

Installation

Choose one of the following methods.

1. Using uv (recommended)

Add to an existing project:

uv add ua-datasets
2. Using pip (PyPI)
# install
pip install ua_datasets
# upgrade
pip install -U ua_datasets
3. From source (editable install)
git clone https://github.com/fido-ai/ua-datasets.git
cd ua-datasets
pip install -e .[dev]  # if you later define optional dev extras

Or with uv (editable semantics via local path):

git clone https://github.com/fido-ai/ua-datasets.git
cd ua-datasets
uv sync --dev

Latest Updates

Date Highlights
25-10-2025 Added validation split for UA-SQuAD and updated package code.
05-07-2022 Added HuggingFace API for UA-SQuAD (Q&A) and UA-News (Text Classification).

Available Datasets

Task Dataset Import Class Splits Notes
Question Answering UA-SQuAD UaSquadDataset train, val SQuAD v2-style examples (is_impossible, multi answers); iteration yields dicts
Text Classification UA-News NewsClassificationDataset train, test CSV (title, text, target[, tags]); optional tag parsing
Token Classification Mova Institute POS MovaInstitutePOSDataset (single corpus) CoNLL-U like POS tagging; yields (tokens, tags) per sentence

Contribution

In case you are willing to contribute (update any part of the library, add your dataset) do not hesitate to connect through GitHub Issue. Thanks in advance for your contribution!

Citation

@software{ua_datasets_2021,
  author = {Ivanyuk-Skulskiy, Bogdan and Zaliznyi, Anton and Reshetar, Oleksand and Protsyk, Oleksiy and Romanchuk, Bohdan and Shpihanovych, Vladyslav},
  month = oct,
  title = {ua_datasets: a collection of Ukrainian language datasets},
  url = {https://github.com/fido-ai/ua-datasets},
  version = {1.0.0},
  year = {2021}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ua_datasets-1.0.1.tar.gz (17.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ua_datasets-1.0.1-py3-none-any.whl (18.7 kB view details)

Uploaded Python 3

File details

Details for the file ua_datasets-1.0.1.tar.gz.

File metadata

  • Download URL: ua_datasets-1.0.1.tar.gz
  • Upload date:
  • Size: 17.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ua_datasets-1.0.1.tar.gz
Algorithm Hash digest
SHA256 dfe2d5339c7274bf68300ac5bcd4ecc091869c46ac3832afc1a5bc749e73d696
MD5 85251deb09ed7dcc1668d727764cead5
BLAKE2b-256 1f0411f0e0b9a0eee4cb3bb94449c22f843cdbd32ecd264dd4a39cf21b86ca83

See more details on using hashes here.

File details

Details for the file ua_datasets-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: ua_datasets-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 18.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ua_datasets-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 956b4ed600765345d6d18b8f6501fc8f7f27ee7dcfcc1027b8918d0ff92fd61f
MD5 67e298dab8ad78dca8cd0501d5602e91
BLAKE2b-256 2dc1d604f96a96038ca078ead633de87d97e27ecf869b9a46e2dbfe8b652911d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page