Skip to main content

GigaDatasets: A Unified and Lightweight Framework for Data Curation, Evaluation and Visualization

Project description

GigaDatasets

A Unified and Lightweight Framework for Data Curation, Evaluation and Visualization

| Quick Start | Contributing | License | Citation |

✨ Introduction

GigaDatasets is a unified and lightweight framework for data curation, evaluation and visualization. Designed to make handling massive datasets simple, efficient, and consistent.

Major features
  • 🔍 Unified Workflow: Unify all steps from data curation and packaging to loading, evaluation, and visualization.
  • Lightweight and Easy to Use: Simple pip/source install pip3 install giga-datasets, one line of code for data loading dataset = load_dataset(data_path), one line of code for data evaluation eval_results = FIDEvaluator(datasets)(pred_results).
  • 🗂️ Multi-format and Multi-structure Data Support: File, LMDB, Pickle, and LeRobot datasets with flexible loading. Unified support for images, videos, 2D/3D boxes, 2D/3D points, and other structured data.
  • 🚀 Efficient Processing: Optimized for speed and memory, suitable for large-scale data processing needs.

⚡ Installation

Datasets can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance)

pip3 install giga-datasets

or you can install directly from source for the latest updates:

conda create -n giga_datasets python=3.11.10
conda activate giga_datasets
git clone https://github.com/open-gigaai/giga-datasets.git
cd giga-datasets
pip3 install -e .

🚀 Usage

We provide accessible demo data and Jupyter notebooks in getting_started. Utility scripts can be found in the scripts folder.

1. load dataset

There is a simple way to load datasets using the load_dataset function from the giga_datasets library. We provide a demo dataset in the giga_data directory for you to try out. Here is a quick example, and the full code is available here:

from giga_datasets import load_dataset

dataset = load_dataset('./getting_started/giga_data')
data_dict = dataset[0]
print('Dataset size:', len(dataset))
print('First item in dataset:', data_dict)

The giga_data directory contains the following structure:

giga_data/
├── config.json          # Configuration file describing the dataset
├── labels/              # Directory containing label files
│   ├── config.json      # Additional configuration for labels
│   ├── data.pkl         # Serialized label data
├── images/              # Directory containing image files
│   ├──config.json       # Additional configuration for images
│   ├──data.mdb          # Lmdb format for images
├   ├──lock.mdb

The config.json file in the giga_data directory contains the following structure:

{
    "_class_name": "Dataset",
    "config_paths": [
        "labels/config.json",
        "images/config.json"
    ]
}

This file specifies:

  • _class_name: Indicates the class type used for the dataset, which is Dataset in this case.
  • config_paths: Lists paths to additional configuration files for specific components of the dataset, such as labels/config.json and images/config.json.

2. package dataset

For an unstructured dataset, you can use the Writer classes (including PklWriter, FileWriter and LmdbWriter to package your data into a structured format. Below is an example of how to package a dataset consisting of images and labels.

The raw_data directory contains the following structure:

raw_data/
├── 0.json               # Annotation file for image 0
├── 0.png                # Image file 0
├── 1.json               # Annotation file for image 1
├── 1.png                # Image file 1
├── ...

You can run the following python code to package the dataset, the full code is available here:

image_paths = utils.list_dir(image_dir, recursive=True, exts=['.png', '.jpg', '.jpeg'])
label_writer = PklWriter(os.path.join(save_dir, 'labels'))
image_writer = LmdbWriter(os.path.join(save_dir, 'images'))
for idx in tqdm(range(len(image_paths))):
    label_path = image_paths[idx].replace('.png', '.json')
    label_dict = json.load(open(label_path))
    label_dict['data_index'] = idx
    label_writer.write_dict(label_dict)
    image_writer.write_image(idx, image_paths[idx])
label_writer.write_config()
image_writer.write_config()
label_writer.close()
image_writer.close()
label_dataset = load_dataset(os.path.join(save_dir, 'labels'))
image_dataset = load_dataset(os.path.join(save_dir, 'images'))
dataset = Dataset([label_dataset, image_dataset])
dataset.save(save_dir)

We supports packaging and reading different data formats. In addition to packaging images, we also provide an example of packaging video data, where we store the video's metadata.

# package video samples in the input directory to the output directory
python getting_started/pack_videos.py --video_dir /path/to/your/raw_videos --save_dir ./giga_videos

# if you want to package videos into lmdb format for better read performance
python getting_started/pack_videos.py --video_dir /path/to/your/raw_videos --save_dir ./giga_videos --pack-lmdb

# if you want to package samples, but not copy the video files and only store the metadata and absolute paths
python getting_started/pack_videos.py --video_dir /path/to/your/raw_videos --save_dir ./giga_videos --only_path

3. add new field

In models' training or inference, a sample is often represented as a dictionary with multiple fields. Our framework is designed to be easily extensible to accommodate new data fields. Below is an example of how to add canny maps as a new field:

python getting_started/add_new_filed.py --data_dir getting_started/giga_data

Additional Usage Examples

Note: More usage examples and feature documentation will be added in future updates—stay tuned!

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

📖 Citation

@misc{gigaai2025gigadatasets,
    author = {GigaAI},
    title = {GigaDatasets: A Unified and Lightweight Framework for Data Curation, Evaluation and Visualization},
    year = {2025},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/open-gigaai/giga-datasets}}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

giga_datasets-1.0.0.tar.gz (82.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

giga_datasets-1.0.0-py3-none-any.whl (102.5 kB view details)

Uploaded Python 3

File details

Details for the file giga_datasets-1.0.0.tar.gz.

File metadata

  • Download URL: giga_datasets-1.0.0.tar.gz
  • Upload date:
  • Size: 82.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.10

File hashes

Hashes for giga_datasets-1.0.0.tar.gz
Algorithm Hash digest
SHA256 a921af478fe4bee0de58f138526dff25b18f17a17f60f1c3f3a6abfb0e4128bc
MD5 dbaaa764d3a1122dc27411ca3ef51215
BLAKE2b-256 bb1e79cf775742b11efd672cd210d61eaee4a53fb7e0b1b968323abde39236c6

See more details on using hashes here.

File details

Details for the file giga_datasets-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: giga_datasets-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 102.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.10

File hashes

Hashes for giga_datasets-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d4e814dcd0896e3898fb11b4377bdf8f90fbd4ac0acd675fb424f7297bd130d6
MD5 136b2a3626cc55ed466c90d56b616f1a
BLAKE2b-256 c96350f88cf328ad013a189f795bf049125889ab5fd8bc0be6c59277857b336f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page