A tool for dataset ingestion and management.

These details have not been verified by PyPI

Project links

Project description

Pygestor

A platform designed to seamlessly acquire, organize, and manage diverse datasets, offering AI researchers a one-line downloader and data-loader for quick access to data, while providing a scalable and easily manageable system for future dataset acquisition.

Quick Start

Install dependencies

pip install -r requirements.txt
python .\run-gui.py

The module can be used with terminal commands or Python APIs (more functionalities). For Python APIs use cases please refer to this notebook.

Configurations

Edit pygestor/__init__.py to change the default system settings. In particular, set DATA_DIR to the desired data storage location, either a local path or a remote path, such as a mounted NFS.

Data info and availability

To list support datasets:

python cli.py -l

To list subsets in a datatset:

python cli.py -l -d <dataset_name>

To list partitions in a subset:

python cli.py -l -d <dataset_name> -s <subset_name>

Dataset management and extension

To download a specific subset:

python cli.py -l -d <dataset_name> -s <subset_name>

To download specific partitions, use Python API pygestor.download().

To remove downloaded data files in a subset:

python cli.py -r -d <dataset_name> -s <subset_name>

To support a new dataset, add a new class file to pygestor/datasets that defines how to organize, download and load data, following the example in pygestor/datasets/wikipedia.py. Then update the metadata by running python cli.py -init -d <new_dataset_name>

Technical Details

Storage

The data is stored in a file storage system and organized into three levels: dataset, subset (distinguished by version, language, class, split, annotation, etc.), and partition (splitting large files into smaller chunks for memory efficiency), as follows:

dataset_A
├── subset_a
│   ├── partition_1
│   └── partition_2
└── subset_b
    ├── partition_1
    └── partition_2
...

File storage is chosen for its comparatively high cost efficiency, scalability, and ease of management compared to other types of storage.

The dataset info and storage status is tracked by a metadata file metadata.json for efficient reference and update.

Dependencies

python >= 3.11
huggingface_hub: Provides native support for datasets hosted on Hugging Face, making it an ideal library for downloading.
pyarrow: Used to compress and extract parquet files, a data file format designed for efficient data storage and retrieval, compatible with pandas.
pandas: Used to load the text dataset into memory for downstream data consumers. It provides a handy API for data manipulation and access, as well as chunking and datatype adjustments for memory efficiency.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.1

Aug 14, 2024

0.2.0

Aug 14, 2024

0.1.2

Aug 13, 2024

This version

0.1.1

Aug 12, 2024

0.1.0

Aug 12, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pygestor-0.1.1.tar.gz (11.5 kB view details)

Uploaded Aug 12, 2024 Source

Built Distribution

pygestor-0.1.1-py3-none-any.whl (12.1 kB view details)

Uploaded Aug 12, 2024 Python 3

File details

Details for the file pygestor-0.1.1.tar.gz.

File metadata

Download URL: pygestor-0.1.1.tar.gz
Upload date: Aug 12, 2024
Size: 11.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for pygestor-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`09571fbc2c526d89cd0cece9c6d0d9b0ffaf526a63931361369f2568bab98712`
MD5	`b44bfc766000043c3da14e338ec8eacd`
BLAKE2b-256	`fe9eca8d5c289cadfc7b23c197589601ceef2c8b8004f537edd6cdd7c62f8978`

See more details on using hashes here.

File details

Details for the file pygestor-0.1.1-py3-none-any.whl.

File metadata

Download URL: pygestor-0.1.1-py3-none-any.whl
Upload date: Aug 12, 2024
Size: 12.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for pygestor-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`51c8b38973ebe176d3275acda223bc7c141d64e9b810f2f42cb6fda642550099`
MD5	`0b7bc6152c6b8e48500705298dda4e4d`
BLAKE2b-256	`70f2b59918012e4a96d21da25955d2eaa54cc40f6c887d037128310fb900b1ee`