A collection of ukrainian language datasets

These details have not been verified by PyPI

Project links

repository

Project description

NaUKMA FIdo Logo

ua_datasets

UA-datasets provides ready-to-use Ukrainian NLP benchmark datasets with a single, lightweight Python API.

Fast access to Question Answering, News Classification, and POS Tagging corpora — with automatic download, caching, and consistent iteration.

Why use this library?

Unified API: All datasets expose len(ds), indexing, iteration, and simple frequency helpers.
Robust downloads: Automatic retries, integrity guards, and filename fallbacks for legacy splits.
Zero heavy deps: Pure Python + standard library (core loaders) for quick startup.
Repro friendly: Validation split for UA-SQuAD; classification CSV parsing with resilience to minor format drift.
Tooling ready: Works seamlessly with ruff, mypy, pytest, and uv-based workflows.

Maintained by the FIdo.ai research group (National University of Kyiv-Mohyla Academy).

Minimal Example

# Assumes `uv` workspace already synced with `uv sync` and project installed.

from pathlib import Path
from ua_datasets.question_answering import UaSquadDataset
from ua_datasets.text_classification import NewsClassificationDataset
from ua_datasets.token_classification import MovaInstitutePOSDataset

# Question Answering (first HF-style example dict)
qa = UaSquadDataset(root=Path("./data/ua_squad"), split="train", download=True)
print("QA examples:", len(qa))
example = qa[0]
print(example.keys())  # id, title, context, question, answers, is_impossible
print(example["question"], "->", example["answers"]["text"])  # list of accepted answers

# News Classification
news = NewsClassificationDataset(root=Path("./data/ua_news"), split="train", download=True)
title, text, target, tags = news[0]
print("Label count:", len(news.labels), "First label:", target)

# Part-of-Speech Tagging
pos = MovaInstitutePOSDataset(root=Path("./data/mova_pos"), download=True)
tokens, tags = pos[0]
print(tokens[:8], tags[:8])

For development commands see the Installation section below.

Installation

Choose one of the following methods.

1. Using uv (recommended)

Add to an existing project:

uv add ua-datasets

2. Using pip (PyPI)

# install
pip install ua_datasets
# upgrade
pip install -U ua_datasets

3. From source (editable install)

git clone https://github.com/fido-ai/ua-datasets.git
cd ua-datasets
pip install -e .[dev]  # if you later define optional dev extras

Or with uv (editable semantics via local path):

git clone https://github.com/fido-ai/ua-datasets.git
cd ua-datasets
uv sync --dev

Latest Updates

Date	Highlights
25-10-2025	Added validation split for UA-SQuAD and updated package code.
05-07-2022	Added HuggingFace API for UA-SQuAD (Q&A) and UA-News (Text Classification).

Available Datasets

Task	Dataset	Import Class	Splits	Notes
Question Answering	UA-SQuAD	`UaSquadDataset`	`train`, `val`	SQuAD v2-style examples (`is_impossible`, multi answers); iteration yields dicts
Text Classification	UA-News	`NewsClassificationDataset`	`train`, `test`	CSV (title, text, target[, tags]); optional tag parsing
Token Classification	Mova Institute POS	`MovaInstitutePOSDataset`	(single corpus)	CoNLL-U like POS tagging; yields (tokens, tags) per sentence

Contribution

In case you are willing to contribute (update any part of the library, add your dataset) do not hesitate to connect through GitHub Issue. Thanks in advance for your contribution!

Citation

@software{ua_datasets_2021,
  author = {Ivanyuk-Skulskiy, Bogdan and Zaliznyi, Anton and Reshetar, Oleksand and Protsyk, Oleksiy and Romanchuk, Bohdan and Shpihanovych, Vladyslav},
  month = oct,
  title = {ua_datasets: a collection of Ukrainian language datasets},
  url = {https://github.com/fido-ai/ua-datasets},
  version = {1.0.0},
  year = {2021}
}

Project details

These details have not been verified by PyPI

Project links

repository

Release history Release notifications | RSS feed

This version

1.0.1

Oct 26, 2025

1.0.0

Oct 26, 2025

0.9.0

Oct 25, 2025

0.1.2

Jul 25, 2024

0.1.1

Oct 18, 2023

0.0.5

Oct 8, 2021

0.0.4

Oct 8, 2021

0.0.3

Oct 8, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ua_datasets-1.0.1.tar.gz (17.6 kB view details)

Uploaded Oct 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ua_datasets-1.0.1-py3-none-any.whl (18.7 kB view details)

Uploaded Oct 26, 2025 Python 3

File details

Details for the file ua_datasets-1.0.1.tar.gz.

File metadata

Download URL: ua_datasets-1.0.1.tar.gz
Upload date: Oct 26, 2025
Size: 17.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ua_datasets-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`dfe2d5339c7274bf68300ac5bcd4ecc091869c46ac3832afc1a5bc749e73d696`
MD5	`85251deb09ed7dcc1668d727764cead5`
BLAKE2b-256	`1f0411f0e0b9a0eee4cb3bb94449c22f843cdbd32ecd264dd4a39cf21b86ca83`

See more details on using hashes here.

File details

Details for the file ua_datasets-1.0.1-py3-none-any.whl.

File metadata

Download URL: ua_datasets-1.0.1-py3-none-any.whl
Upload date: Oct 26, 2025
Size: 18.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ua_datasets-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`956b4ed600765345d6d18b8f6501fc8f7f27ee7dcfcc1027b8918d0ff92fd61f`
MD5	`67e298dab8ad78dca8cd0501d5602e91`
BLAKE2b-256	`2dc1d604f96a96038ca078ead633de87d97e27ecf869b9a46e2dbfe8b652911d`

See more details on using hashes here.

ua-datasets 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ua_datasets

Why use this library?

Minimal Example

Installation

1. Using uv (recommended)

Latest Updates

Available Datasets

Contribution

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes