Skip to main content

A tool for dataset ingestion and management.

Project description

Pygestor

A platform designed to seamlessly acquire, organize, and manage diverse datasets, offering AI researchers a one-line downloader and data-loader for quick access to data, while providing a scalable and easily manageable system for future dataset acquisition.

Python application Publish Python Package GitHub branch status GitHub deployments GitHub Release PyPI - Version License: MIT

Key Features

  • Dataset Acquisition:

    • Support for downloading and loading datasets with a simple one-line command.
    • Automatic handling of subsets and partitions for efficient data storage and access.
  • Data Organization:

    • Three-level data organization structure: dataset, subset, and partition.
    • Support for both local and network file systems for data storage.
    • Efficient handling of large files by allowing batched loading.
  • Graphic User Interface

    • Introduced a Web-GUI for intuitive data management and analysis.
    • Support for viewing schema, metadata and data samples.
    • Ability to download and remove one subset or multiple partitions in one go.
    • Support for data searching and sorting.
    • Ability to generate code snippets for quick access to datasets.

Quick Start

Installation

pip install -r requirements.txt

or

pip install pygestor

The module can be used with a GUI, terminal commands or Python APIs (more functionalities). For Python APIs use cases please refer to this notebook.

Configurations

Edit system.conf to change the default system settings. In particular, set data_dir to the desired data storage location, either a local path or a remote path, such as a mounted NFS.

Run GUI

python .\run-gui.py

Data info and availability

To list support datasets:

python cli.py -l

To list subsets in a datatset:

python cli.py -l -d <dataset_name>

To list partitions in a subset:

python cli.py -l -d <dataset_name> -s <subset_name>

Dataset management and extension

To download a specific subset:

python cli.py -l -d <dataset_name> -s <subset_name>

To download specific partitions, use Python API pygestor.download().

To remove downloaded data files in a subset:

python cli.py -r -d <dataset_name> -s <subset_name>

To support a new dataset, add a new class file to pygestor/datasets that defines how to organize, download and load data, following the example in pygestor/datasets/wikipedia.py. Then update the metadata by running python cli.py -init -d <new_dataset_name>

Technical Details

Storage

The data is stored in a file storage system and organized into three levels: dataset, subset (distinguished by version, language, class, split, annotation, etc.), and partition (splitting large files into smaller chunks for memory efficiency), as follows:

dataset_A
├── subset_a
│   ├── partition_1
│   └── partition_2
└── subset_b
    ├── partition_1
    └── partition_2
...

File storage is chosen for its comparatively high cost efficiency, scalability, and ease of management compared to other types of storage.

The dataset info and storage status is tracked by a metadata file metadata.json for efficient reference and update.

Dependencies

  • python >= 3.11
  • huggingface_hub: Provides native support for datasets hosted on Hugging Face, making it an ideal library for downloading.
  • pyarrow: Used to compress and extract parquet files, a data file format designed for efficient data storage and retrieval, compatible with pandas.
  • pandas: Used to load the text dataset into memory for downstream data consumers. It provides a handy API for data manipulation and access, as well as chunking and datatype adjustments for memory efficiency.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pygestor-0.1.2.tar.gz (17.3 kB view details)

Uploaded Source

Built Distribution

pygestor-0.1.2-py3-none-any.whl (18.7 kB view details)

Uploaded Python 3

File details

Details for the file pygestor-0.1.2.tar.gz.

File metadata

  • Download URL: pygestor-0.1.2.tar.gz
  • Upload date:
  • Size: 17.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for pygestor-0.1.2.tar.gz
Algorithm Hash digest
SHA256 17495b986be9fde210950a8d3532d4316314287ab805c92c2190073fda118676
MD5 3f25447cb8e0ea2e54133ffa78cc4013
BLAKE2b-256 98a14587008a251585120456bea4bffc2a76d273573b2d8745f7505f6df506d1

See more details on using hashes here.

File details

Details for the file pygestor-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: pygestor-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 18.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for pygestor-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 78413233ad65342b574450759b8dd7e5a66846a1605c38c5e2bcd8cce1937dba
MD5 bce72736f768cb5e2b25c221f9ef52cb
BLAKE2b-256 708c9c0a6b72d36a3eeb5c83964e44c77e14fb8b1e419c005f8141c3451bb78b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page