Skip to main content

Utility package for managing dataset

Project description

DatasetHandler

Introduction

DatasetHandler is a cli tool for downloading datasets and performing any necessary preprocessing. It provides two main commands: download and extract, enabling easy management of datasets.

$ datasets
Usage: datasets [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  download
  extract

Installation

pip install dataset_handler

Usage

datasets download https://lab.osai.ai/datasets/openttgames/data /data/dataset

Performance

DatasetHandler leverages multiprocessing to significantly enhance performance by utilizing multiple CPU cores. This parallel processing capability ensures faster execution of time-consuming tasks, making the tool efficient for handling large datasets.

Key Areas Utilizing Multiprocessing:

  • Downloading Files: The download_multiprocess function employs multiple processes to download files concurrently, reducing the time required to fetch large datasets from the internet.
  • Unarchiving Files: The unarchive_multiprocess function unpacks multiple archive files simultaneously, speeding up the extraction process of downloaded data.
  • Extracting Images from Videos: The extract_multiprocess function processes multiple video files in parallel to extract frames, which is especially useful for large collections of video data.

By parallelizing these tasks, DatasetHandler ensures that data preparation steps are performed efficiently, saving valuable time and computational resources.

Contributing

  • We use poetry for managing dependencies, please make sure you have poetry installed.
poetry version
  • Install the dependencies using poetry
poetry install --with dev
  • Before you commit and push your changes please run the following
poetry run ruff check
poetry run ruff format
poetry run mypy
poetry run pytest

TODO

  • [documentation] Add badges to README.md.
  • [refactor] Add error handling.
  • [refactor] Add logging.
  • [test] Try to minimize fixtures by using more of pytest-mock.
  • [test] Increase code coverage to 40%.
  • [fix] Triage why coverage data is different on certain versions of python

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataset_handler-0.1.0.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

dataset_handler-0.1.0-py3-none-any.whl (7.2 kB view details)

Uploaded Python 3

File details

Details for the file dataset_handler-0.1.0.tar.gz.

File metadata

  • Download URL: dataset_handler-0.1.0.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.12 Linux/6.5.0-1025-azure

File hashes

Hashes for dataset_handler-0.1.0.tar.gz
Algorithm Hash digest
SHA256 abe7f1eb65bd071857a1eb7828e12e6bfd00a5b1f64d59b6d916f76e769f219c
MD5 5df730830167eb05541b76818133bc56
BLAKE2b-256 51b9936f2adb6410f0befe8558845c13afa18c5f413841bf681b200ee186dab6

See more details on using hashes here.

File details

Details for the file dataset_handler-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: dataset_handler-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 7.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.12 Linux/6.5.0-1025-azure

File hashes

Hashes for dataset_handler-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8d2c27af9d598fca5d6a578fb61fa44b1dfe5e13ead74ceb30eb5f0523de73d5
MD5 a1984d31fd7f05e93ef41adba4b469ed
BLAKE2b-256 0f7db323c4be9d77932490050f3f715a0281d455930a36d5e145a1a057bc9477

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page