Utility package for managing dataset
Project description
DatasetHandler
Introduction
DatasetHandler is a cli tool for downloading datasets and performing any necessary preprocessing.
It provides two main commands: download
and extract
, enabling easy management of datasets.
$ datasets
Usage: datasets [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
download
extract
Installation
pip install dataset_handler
Usage
datasets download https://lab.osai.ai/datasets/openttgames/data /data/dataset
Performance
DatasetHandler leverages multiprocessing to significantly enhance performance by utilizing multiple CPU cores. This parallel processing capability ensures faster execution of time-consuming tasks, making the tool efficient for handling large datasets.
Key Areas Utilizing Multiprocessing:
- Downloading Files: The
download_multiprocess
function employs multiple processes to download files concurrently, reducing the time required to fetch large datasets from the internet. - Unarchiving Files: The
unarchive_multiprocess
function unpacks multiple archive files simultaneously, speeding up the extraction process of downloaded data. - Extracting Images from Videos: The
extract_multiprocess
function processes multiple video files in parallel to extract frames, which is especially useful for large collections of video data.
By parallelizing these tasks, DatasetHandler ensures that data preparation steps are performed efficiently, saving valuable time and computational resources.
Contributing
- We use poetry for managing dependencies, please make sure you have poetry installed.
poetry version
- Install the dependencies using
poetry
poetry install --with dev
- Before you commit and push your changes please run the following
poetry run ruff check
poetry run ruff format
poetry run mypy
poetry run pytest
TODO
- [documentation] Add badges to
README.md
. - [refactor] Add error handling.
- [refactor] Add logging.
- [test] Try to minimize fixtures by using more of pytest-mock.
- [test] Increase code coverage to 40%.
- [fix] Triage why coverage data is different on certain versions of python
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file dataset_handler-0.1.0.tar.gz
.
File metadata
- Download URL: dataset_handler-0.1.0.tar.gz
- Upload date:
- Size: 6.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.10.12 Linux/6.5.0-1025-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | abe7f1eb65bd071857a1eb7828e12e6bfd00a5b1f64d59b6d916f76e769f219c |
|
MD5 | 5df730830167eb05541b76818133bc56 |
|
BLAKE2b-256 | 51b9936f2adb6410f0befe8558845c13afa18c5f413841bf681b200ee186dab6 |
File details
Details for the file dataset_handler-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: dataset_handler-0.1.0-py3-none-any.whl
- Upload date:
- Size: 7.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.10.12 Linux/6.5.0-1025-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8d2c27af9d598fca5d6a578fb61fa44b1dfe5e13ead74ceb30eb5f0523de73d5 |
|
MD5 | a1984d31fd7f05e93ef41adba4b469ed |
|
BLAKE2b-256 | 0f7db323c4be9d77932490050f3f715a0281d455930a36d5e145a1a057bc9477 |