Skip to main content

Automated workflow for dataset discovery and downloading in ML/DL and data analysis projects.

Project description

kaggle_utils (beta 0.1.0)

License: MIT Python Status GitHub stars

Utilities for checking, downloading, and extracting Kaggle datasets within ML/DL and data analysis workflows.


Features

  • Check whether a dataset directory exists and list contained files
  • Download and unzip Kaggle datasets using Kaggle CLI
  • Configurable dataset root directory via environment variable
  • Lightweight, reusable design for notebook and local workflows

Requirements

  • Python 3.10+
  • Kaggle CLI installed and authenticated

Install Kaggle CLI

pip install kaggle

Ensure the kaggle command is available in your PATH.

Configure your Kaggle API token:

  1. Download kaggle.json from your Kaggle account
  2. Place it in:
~/.kaggle/kaggle.json

Ensure correct permissions:

chmod 600 ~/.kaggle/kaggle.json

Installation

Local (Editable Mode – Recommended for Development)

From the project root:

pip install -e .

Configuration

Dataset root directory defaults to:

datasets

Override it via environment variable:

macOS / Linux:

export KAGGLE_INPUT_DIR="datasets"

Windows (PowerShell):

setx KAGGLE_INPUT_DIR "datasets"

Usage

1️⃣ Check Local Dataset

from kaggle_utils import check

result = check("titanic")

if not result:
    print("Dataset not found or empty.")
else:
    print(result)

Example return:

{
    "datasets/titanic": ["train.csv", "test.csv"]
}

2️⃣ Download Dataset

from kaggle_utils import download

files = download(owner="heptapod", dataset="titanic")

print(files)

If the dataset does not exist or the CLI fails,
a RuntimeError will be raised.


Error Handling

  • check() → returns empty dict {} if dataset is missing
  • download() → raises RuntimeError on failure
  • Kaggle Notebook environments are not supported

Project Structure

kaggle_utils/
  __init__.py
  config.py
  check.py
  download.py

Design Philosophy

  • Separation of concerns
  • Minimal external dependencies
  • Clear failure handling strategy
  • Designed for local ML experimentation

Roadmap

beta 0.2.0

  • download_if_not_exists() orchestration helper
  • Logging support
  • Retry mechanism

1.0.0

  • Stable API
  • PyPI publication

License

License: MIT

Distributed under the MIT License. See the LICENSE for more information.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kaggle_utils_dataset-0.1.0b0.tar.gz (6.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kaggle_utils_dataset-0.1.0b0-py3-none-any.whl (7.8 kB view details)

Uploaded Python 3

File details

Details for the file kaggle_utils_dataset-0.1.0b0.tar.gz.

File metadata

  • Download URL: kaggle_utils_dataset-0.1.0b0.tar.gz
  • Upload date:
  • Size: 6.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for kaggle_utils_dataset-0.1.0b0.tar.gz
Algorithm Hash digest
SHA256 14902e5b97177448d973c3104cbc841565597f8a9583e46c7ac519e387767158
MD5 0da8d10c76da7a9e33a4cf77e80e3a15
BLAKE2b-256 d311de9584b6fe78352f4a4964a0f1429b04c31affe09537c4100cc70e22dcb4

See more details on using hashes here.

File details

Details for the file kaggle_utils_dataset-0.1.0b0-py3-none-any.whl.

File metadata

File hashes

Hashes for kaggle_utils_dataset-0.1.0b0-py3-none-any.whl
Algorithm Hash digest
SHA256 6ecf2a84aaea5e4c62b079fe36ce9d5eeb9c098d0331118c0f5c6d215b11368d
MD5 7664b3c381cb6f7a67f07fb529c5882d
BLAKE2b-256 19a45e354087d1cbd65b148b1e9f2d9697a5ae31036c593f05e152aa8a4da7be

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page