giga-datasets

GigaDatasets: A Unified and Lightweight Framework for Data Curation, Evaluation and Visualization

These details have not been verified by PyPI

Project links

Homepage

Intended Audience
- Developers
Operating System
- OS Independent
Programming Language
- Python :: 3
- Python :: 3 :: Only
Topic
- Scientific/Engineering :: Artificial Intelligence

Project description

GigaDatasets

A Unified and Lightweight Framework for Data Curation, Evaluation and Visualization

✨ Introduction

GigaDatasets is a unified and lightweight framework for data curation, evaluation and visualization. Designed to make handling massive datasets simple, efficient, and consistent.

Major features

🔍 Unified Workflow: Unify all steps from data curation and packaging to loading, evaluation, and visualization.
⚡ Lightweight and Easy to Use: Simple pip/source install pip3 install giga-datasets, one line of code for data loading dataset = load_dataset(data_path), one line of code for data evaluation eval_results = FIDEvaluator(datasets)(pred_results).
🗂️ Multi-format and Multi-structure Data Support: File, LMDB, Pickle, and LeRobot datasets with flexible loading. Unified support for images, videos, 2D/3D boxes, 2D/3D points, and other structured data.
🚀 Efficient Processing: Optimized for speed and memory, suitable for large-scale data processing needs.

⚡ Installation

Datasets can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance)

pip3 install giga-datasets

or you can install directly from source for the latest updates:

conda create -n giga_datasets python=3.11.10
conda activate giga_datasets
git clone https://github.com/open-gigaai/giga-datasets.git
cd giga-datasets
pip3 install -e .

🚀 Usage

We provide accessible demo data and Jupyter notebooks in getting_started. Utility scripts can be found in the scripts folder.

1. load dataset

There is a simple way to load datasets using the load_dataset function from the giga_datasets library. We provide a demo dataset in the giga_data directory for you to try out. Here is a quick example, and the full code is available here:

from giga_datasets import load_dataset

dataset = load_dataset('./getting_started/giga_data')
data_dict = dataset[0]
print('Dataset size:', len(dataset))
print('First item in dataset:', data_dict)

The giga_data directory contains the following structure:

giga_data/
├── config.json          # Configuration file describing the dataset
├── labels/              # Directory containing label files
│   ├── config.json      # Additional configuration for labels
│   ├── data.pkl         # Serialized label data
├── images/              # Directory containing image files
│   ├──config.json       # Additional configuration for images
│   ├──data.mdb          # Lmdb format for images
├   ├──lock.mdb

The config.json file in the giga_data directory contains the following structure:

{
    "_class_name": "Dataset",
    "config_paths": [
        "labels/config.json",
        "images/config.json"
    ]
}

This file specifies:

_class_name: Indicates the class type used for the dataset, which is Dataset in this case.
config_paths: Lists paths to additional configuration files for specific components of the dataset, such as labels/config.json and images/config.json.

2. package dataset

For an unstructured dataset, you can use the Writer classes (including PklWriter, FileWriter and LmdbWriter to package your data into a structured format. Below is an example of how to package a dataset consisting of images and labels.

The raw_data directory contains the following structure:

raw_data/
├── 0.json               # Annotation file for image 0
├── 0.png                # Image file 0
├── 1.json               # Annotation file for image 1
├── 1.png                # Image file 1
├── ...

You can run the following python code to package the dataset, the full code is available here:

image_paths = utils.list_dir(image_dir, recursive=True, exts=['.png', '.jpg', '.jpeg'])
label_writer = PklWriter(os.path.join(save_dir, 'labels'))
image_writer = LmdbWriter(os.path.join(save_dir, 'images'))
for idx in tqdm(range(len(image_paths))):
    label_path = image_paths[idx].replace('.png', '.json')
    label_dict = json.load(open(label_path))
    label_dict['data_index'] = idx
    label_writer.write_dict(label_dict)
    image_writer.write_image(idx, image_paths[idx])
label_writer.write_config()
image_writer.write_config()
label_writer.close()
image_writer.close()
label_dataset = load_dataset(os.path.join(save_dir, 'labels'))
image_dataset = load_dataset(os.path.join(save_dir, 'images'))
dataset = Dataset([label_dataset, image_dataset])
dataset.save(save_dir)

We supports packaging and reading different data formats. In addition to packaging images, we also provide an example of packaging video data, where we store the video's metadata.

# package video samples in the input directory to the output directory
python getting_started/pack_videos.py --video_dir /path/to/your/raw_videos --save_dir ./giga_videos

# if you want to package videos into lmdb format for better read performance
python getting_started/pack_videos.py --video_dir /path/to/your/raw_videos --save_dir ./giga_videos --pack-lmdb

# if you want to package samples, but not copy the video files and only store the metadata and absolute paths
python getting_started/pack_videos.py --video_dir /path/to/your/raw_videos --save_dir ./giga_videos --only_path

3. add new field

In models' training or inference, a sample is often represented as a dictionary with multiple fields. Our framework is designed to be easily extensible to accommodate new data fields. Below is an example of how to add canny maps as a new field:

python getting_started/add_new_filed.py --data_dir getting_started/giga_data

Additional Usage Examples

Note: More usage examples and feature documentation will be added in future updates—stay tuned!

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

📖 Citation

@misc{gigaai2025gigadatasets,
    author = {GigaAI},
    title = {GigaDatasets: A Unified and Lightweight Framework for Data Curation, Evaluation and Visualization},
    year = {2025},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/open-gigaai/giga-datasets}}
}

Project details

These details have not been verified by PyPI

Project links

Homepage

Intended Audience
- Developers
Operating System
- OS Independent
Programming Language
- Python :: 3
- Python :: 3 :: Only
Topic
- Scientific/Engineering :: Artificial Intelligence

Release history Release notifications | RSS feed

This version

1.0.0

Oct 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

giga_datasets-1.0.0.tar.gz (82.3 kB view details)

Uploaded Oct 28, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

giga_datasets-1.0.0-py3-none-any.whl (102.5 kB view details)

Uploaded Oct 28, 2025 Python 3

File details

Details for the file giga_datasets-1.0.0.tar.gz.

File metadata

Download URL: giga_datasets-1.0.0.tar.gz
Upload date: Oct 28, 2025
Size: 82.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.10

File hashes

Hashes for giga_datasets-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`a921af478fe4bee0de58f138526dff25b18f17a17f60f1c3f3a6abfb0e4128bc`
MD5	`dbaaa764d3a1122dc27411ca3ef51215`
BLAKE2b-256	`bb1e79cf775742b11efd672cd210d61eaee4a53fb7e0b1b968323abde39236c6`

See more details on using hashes here.

File details

Details for the file giga_datasets-1.0.0-py3-none-any.whl.

File metadata

Download URL: giga_datasets-1.0.0-py3-none-any.whl
Upload date: Oct 28, 2025
Size: 102.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.10

File hashes

Hashes for giga_datasets-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d4e814dcd0896e3898fb11b4377bdf8f90fbd4ac0acd675fb424f7297bd130d6`
MD5	`136b2a3626cc55ed466c90d56b616f1a`
BLAKE2b-256	`c96350f88cf328ad013a189f795bf049125889ab5fd8bc0be6c59277857b336f`

See more details on using hashes here.

giga-datasets 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

A Unified and Lightweight Framework for Data Curation, Evaluation and Visualization

✨ Introduction

⚡ Installation

🚀 Usage

1. load dataset

2. package dataset

3. add new field

Additional Usage Examples

🤝 Contributing

📄 License

📖 Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes