Skip to main content

Automated workflow for dataset discovery and downloading in ML/DL and data analysis projects.

Project description

kaggle_utils (beta 0.1.0)

License: MIT Python Status GitHub stars

Utilities for checking, downloading, and extracting Kaggle datasets within ML/DL and data analysis workflows.


Features

  • Check whether a dataset directory exists and list contained files
  • Download and unzip Kaggle datasets using Kaggle CLI
  • Configurable dataset root directory via environment variable
  • Lightweight, reusable design for notebook and local workflows

Requirements

  • Python 3.10+
  • Kaggle CLI installed and authenticated

Install Kaggle CLI

pip install kaggle

Ensure the kaggle command is available in your PATH.

Configure your Kaggle API token:

  1. Download kaggle.json from your Kaggle account
  2. Place it in:
~/.kaggle/kaggle.json

Ensure correct permissions:

chmod 600 ~/.kaggle/kaggle.json

Installation

Local (Editable Mode – Recommended for Development)

From the project root:

pip install -e .

Configuration

Dataset root directory defaults to:

datasets

Override it via environment variable:

macOS / Linux:

export KAGGLE_INPUT_DIR="datasets"

Windows (PowerShell):

setx KAGGLE_INPUT_DIR "datasets"

Usage

1️⃣ Check Local Dataset

from kaggle_utils import check

result = check("titanic")

if not result:
    print("Dataset not found or empty.")
else:
    print(result)

Example return:

{
    "datasets/titanic": ["train.csv", "test.csv"]
}

2️⃣ Download Dataset

from kaggle_utils import download

files = download(owner="heptapod", dataset="titanic")

print(files)

If the dataset does not exist or the CLI fails,
a RuntimeError will be raised.


Error Handling

  • check() → returns empty dict {} if dataset is missing
  • download() → raises RuntimeError on failure
  • Kaggle Notebook environments are not supported

Project Structure

kaggle_utils/
  __init__.py
  config.py
  check.py
  download.py

Design Philosophy

  • Separation of concerns
  • Minimal external dependencies
  • Clear failure handling strategy
  • Designed for local ML experimentation

Roadmap

beta 0.2.0

  • download_if_not_exists() orchestration helper
  • Logging support
  • Retry mechanism

1.0.0

  • Stable API
  • PyPI publication

License

License: MIT

Distributed under the MIT License. See the LICENSE for more information.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kaggle_utils_dataset-0.1.0b1.tar.gz (6.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kaggle_utils_dataset-0.1.0b1-py3-none-any.whl (7.8 kB view details)

Uploaded Python 3

File details

Details for the file kaggle_utils_dataset-0.1.0b1.tar.gz.

File metadata

  • Download URL: kaggle_utils_dataset-0.1.0b1.tar.gz
  • Upload date:
  • Size: 6.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for kaggle_utils_dataset-0.1.0b1.tar.gz
Algorithm Hash digest
SHA256 1784e7bb58a626de3a06529a06069171fccd90c511ddaaf406780c9a54a68055
MD5 3aa4011f8537b9afefd740577744d751
BLAKE2b-256 4bcd6e373b30bc4b2233a4687dd0dc1ad3767228e2cbe638303ec0217f4e0ac3

See more details on using hashes here.

File details

Details for the file kaggle_utils_dataset-0.1.0b1-py3-none-any.whl.

File metadata

File hashes

Hashes for kaggle_utils_dataset-0.1.0b1-py3-none-any.whl
Algorithm Hash digest
SHA256 983c24b595cb91dfea4b4e7299ce9eb1c5ecd69e8b96ebb645862f1562cae8c0
MD5 28f15b9f895674b11409b01733088ce3
BLAKE2b-256 9e751f2c576a39ed15ddf30ed2510c154f415211b9f0918a771ed4ea8f5f71c3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page