A collection of ukrainian language datasets
Project description
ua_datasets
UA-datasets provides ready-to-use Ukrainian NLP benchmark datasets with a single, lightweight Python API.
Fast access to Question Answering, News Classification, and POS Tagging corpora — with automatic download, caching, and consistent iteration.
Why use this library?
- Unified API: All datasets expose
len(ds), indexing, iteration, and simple frequency helpers. - Robust downloads: Automatic retries, integrity guards, and filename fallbacks for legacy splits.
- Zero heavy deps: Pure Python + standard library (core loaders) for quick startup.
- Repro friendly: Validation split for UA-SQuAD; classification CSV parsing with resilience to minor format drift.
- Tooling ready: Works seamlessly with ruff, mypy, pytest, and uv-based workflows.
Maintained by the FIdo.ai research group (National University of Kyiv-Mohyla Academy).
Minimal Example
# Assumes `uv` workspace already synced with `uv sync` and project installed.
from pathlib import Path
from ua_datasets.question_answering import UaSquadDataset
from ua_datasets.text_classification import NewsClassificationDataset
from ua_datasets.token_classification import MovaInstitutePOSDataset
# Question Answering (first HF-style example dict)
qa = UaSquadDataset(root=Path("./data/ua_squad"), split="train", download=True)
print("QA examples:", len(qa))
example = qa[0]
print(example.keys()) # id, title, context, question, answers, is_impossible
print(example["question"], "->", example["answers"]["text"]) # list of accepted answers
# News Classification
news = NewsClassificationDataset(root=Path("./data/ua_news"), split="train", download=True)
title, text, target, tags = news[0]
print("Label count:", len(news.labels), "First label:", target)
# Part-of-Speech Tagging
pos = MovaInstitutePOSDataset(root=Path("./data/mova_pos"), download=True)
tokens, tags = pos[0]
print(tokens[:8], tags[:8])
For development commands see the Installation section below.
Installation
Choose one of the following methods.
1. Using uv (recommended)
Add to an existing project:
uv add ua-datasets
2. Using pip (PyPI)
# install
pip install ua_datasets
# upgrade
pip install -U ua_datasets
3. From source (editable install)
git clone https://github.com/fido-ai/ua-datasets.git
cd ua-datasets
pip install -e .[dev] # if you later define optional dev extras
Or with uv (editable semantics via local path):
git clone https://github.com/fido-ai/ua-datasets.git
cd ua-datasets
uv sync --dev
Latest Updates
| Date | Highlights |
|---|---|
| 25-10-2025 | Added validation split for UA-SQuAD and updated package code. |
| 05-07-2022 | Added HuggingFace API for UA-SQuAD (Q&A) and UA-News (Text Classification). |
Available Datasets
| Task | Dataset | Import Class | Splits | Notes |
|---|---|---|---|---|
| Question Answering | UA-SQuAD | UaSquadDataset |
train, val |
SQuAD v2-style examples (is_impossible, multi answers); iteration yields dicts |
| Text Classification | UA-News | NewsClassificationDataset |
train, test |
CSV (title, text, target[, tags]); optional tag parsing |
| Token Classification | Mova Institute POS | MovaInstitutePOSDataset |
(single corpus) | CoNLL-U like POS tagging; yields (tokens, tags) per sentence |
Contribution
In case you are willing to contribute (update any part of the library, add your dataset) do not hesitate to connect through GitHub Issue. Thanks in advance for your contribution!
Citation
@software{ua_datasets_2021,
author = {Ivanyuk-Skulskiy, Bogdan and Zaliznyi, Anton and Reshetar, Oleksand and Protsyk, Oleksiy and Romanchuk, Bohdan and Shpihanovych, Vladyslav},
month = oct,
title = {ua_datasets: a collection of Ukrainian language datasets},
url = {https://github.com/fido-ai/ua-datasets},
version = {1.0.0},
year = {2021}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ua_datasets-1.0.1.tar.gz.
File metadata
- Download URL: ua_datasets-1.0.1.tar.gz
- Upload date:
- Size: 17.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dfe2d5339c7274bf68300ac5bcd4ecc091869c46ac3832afc1a5bc749e73d696
|
|
| MD5 |
85251deb09ed7dcc1668d727764cead5
|
|
| BLAKE2b-256 |
1f0411f0e0b9a0eee4cb3bb94449c22f843cdbd32ecd264dd4a39cf21b86ca83
|
File details
Details for the file ua_datasets-1.0.1-py3-none-any.whl.
File metadata
- Download URL: ua_datasets-1.0.1-py3-none-any.whl
- Upload date:
- Size: 18.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
956b4ed600765345d6d18b8f6501fc8f7f27ee7dcfcc1027b8918d0ff92fd61f
|
|
| MD5 |
67e298dab8ad78dca8cd0501d5602e91
|
|
| BLAKE2b-256 |
2dc1d604f96a96038ca078ead633de87d97e27ecf869b9a46e2dbfe8b652911d
|