Skip to main content

A tool for dataset ingestion and management.

Project description

Pygestor

Python application Publish Python Package GitHub deployments GitHub Release PyPI - Version License: MIT

A data interface designed to seamlessly acquire, organize, and manage diverse datasets, offering AI researchers a one-line downloader and data-loader for quick access to data, while providing a scalable and easily manageable system for future dataset acquisition.

Key Features

  • Dataset Acquisition & Usage:

    • Support for downloading and loading datasets with a simple one-line command.
    • Automatic handling of subsets and partitions for efficient data storage and access.
    • Support dataset batched loading.
    • Adding new datasets via URL with minimal effort
  • Data Organization:

    • Three-level data organization structure: dataset, subset, and partition.
    • Support for both local and network file systems for data storage.
    • Efficient handling of large files by storing data in partitions.
  • Web Interface

    • Introduced a web UI for intuitive data management and analysis.
    • Support for viewing schema, metadata and data samples.
    • Ability to download and remove one subset or multiple partitions in one go.
    • Support for data searching and sorting.
    • Ability to generate code snippets for quick access to datasets.
    • Support for creating and deleting metadata for new datasets.

Quick Start

Installation

pip install -r requirements.txt

or

pip install pygestor

The module can be used with a webUI, terminal commands or Python APIs (more functionalities). For Python APIs introductions please refer to this notebook.

Configurations

Edit confs/system.conf to change the default system settings. In particular, set data_dir to the desired data storage location, either a local path or a cloud NFS.

Run GUI

python .\run-gui.py

For a usage guide on the CLI, refer to docs/cli_usage.md

Download Dataset

Datasets can be downloaded via the WebUI or using the API. Run the following example script to download '20231101.en' subset from wikimedia/wikipedia, and the first 10 parquet files from wikimedia/wit_base

python .\examples\download_example.py

Adding a New Dataset

New datasets can be added using predefined ingestion and processing pipelines. For example, the HuggingFaceParquet pipeline can be used to ingest Parquet datasets from Hugging Face. It is recommended to use the WebUI for this process. In the "Add New" menu, fill in the dataset name, URL, and pipeline name to retrieve and save the metadata of the new dataset. For example:

If a custom pipeline is required for datasets that don't fit the general pipelines, you will need to add a new pipeline to pygestor/datasets that defines how to organize, download, and process the data. You can follow the example provided in pygestor/datasets/wikipedia.py. Ensure that the pipeline name matches your desired dataset name. After that, update the metadata by running

python cli.py -init -d <new_dataset_name>

Technical Details

Storage

The data is stored in a file storage system and organized into three levels: dataset, subset (distinguished by version, language, class, split, annotation, etc.), and partition (splitting large files into smaller chunks for memory efficiency), as follows:

dataset_A
├── subset_a
│   ├── partition_1
│   └── partition_2
└── subset_b
    ├── partition_1
    └── partition_2
...

File storage is used for its comparatively high cost efficiency, scalability, and ease of management compared to other types of storage.

The dataset info and storage status is tracked by a metadata file metadata.json for efficient reference and update.

Dependencies

  • python >= 3.11
  • huggingface_hub: Provides native support for datasets hosted on Hugging Face, making it an ideal library for downloading.
  • pyarrow: Used to compress and extract parquet files, a data file format designed for efficient data storage and retrieval.
  • pandas: Used to structure the dataset info tabular form for downstream data consumers. It provides a handy API for data manipulation and access, as well as chunking and datatype adjustments for memory efficiency.
  • nicegui (optional): Used to serve webUI frontend

Dataset Expansion

For a proposed management process to handle future dataset expansions, refer to docs/dataset_expansion.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pygestor-0.2.1.tar.gz (20.6 kB view details)

Uploaded Source

Built Distribution

pygestor-0.2.1-py3-none-any.whl (23.0 kB view details)

Uploaded Python 3

File details

Details for the file pygestor-0.2.1.tar.gz.

File metadata

  • Download URL: pygestor-0.2.1.tar.gz
  • Upload date:
  • Size: 20.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for pygestor-0.2.1.tar.gz
Algorithm Hash digest
SHA256 69796fe41b392efe4d88a230e123e2bcbc72f50daba9994ba41264b63be96f4d
MD5 c6746133afa1dbd90bad8874489bc57e
BLAKE2b-256 e53d339c76f1461a170f6e0e4a9df16535b92e942f12ecce41e474e1227fa1ef

See more details on using hashes here.

File details

Details for the file pygestor-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: pygestor-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 23.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for pygestor-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cb77fc93b3e6914ed4b287549a75504ff2dc3570af57b74d092248c4ed548677
MD5 d2be5857b5289bb5a5d9b016bd280c44
BLAKE2b-256 70c29f206e0ebf28601b93cb7bec4036037b22646bad28e8f1c029b8c1e86020

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page