Automated workflow for dataset discovery and downloading in ML/DL and data analysis projects.
Project description
kaggle_utils (beta 0.1.0)
Utilities for checking, downloading, and extracting Kaggle datasets within ML/DL and data analysis workflows.
Features
- Check whether a dataset directory exists and list contained files
- Download and unzip Kaggle datasets using Kaggle CLI
- Configurable dataset root directory via environment variable
- Lightweight, reusable design for notebook and local workflows
Requirements
- Python 3.10+
- Kaggle CLI installed and authenticated
Install Kaggle CLI
pip install kaggle
Ensure the kaggle command is available in your PATH.
Configure your Kaggle API token:
- Download kaggle.json from your Kaggle account
- Place it in:
~/.kaggle/kaggle.json
Ensure correct permissions:
chmod 600 ~/.kaggle/kaggle.json
Installation
Local (Editable Mode – Recommended for Development)
From the project root:
pip install -e .
Configuration
Dataset root directory defaults to:
datasets
Override it via environment variable:
macOS / Linux:
export KAGGLE_INPUT_DIR="datasets"
Windows (PowerShell):
setx KAGGLE_INPUT_DIR "datasets"
Usage
1️⃣ Check Local Dataset
from kaggle_utils import check
result = check("titanic")
if not result:
print("Dataset not found or empty.")
else:
print(result)
Example return:
{
"datasets/titanic": ["train.csv", "test.csv"]
}
2️⃣ Download Dataset
from kaggle_utils import download
files = download(owner="heptapod", dataset="titanic")
print(files)
If the dataset does not exist or the CLI fails,
a RuntimeError will be raised.
Error Handling
- check() → returns empty dict {} if dataset is missing
- download() → raises RuntimeError on failure
- Kaggle Notebook environments are not supported
Project Structure
kaggle_utils/
__init__.py
config.py
check.py
download.py
Design Philosophy
- Separation of concerns
- Minimal external dependencies
- Clear failure handling strategy
- Designed for local ML experimentation
Roadmap
beta 0.2.0
- download_if_not_exists() orchestration helper
- Logging support
- Retry mechanism
1.0.0
- Stable API
- PyPI publication
License
Distributed under the MIT License. See the LICENSE for more information.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kaggle_utils_dataset-0.1.0b0.tar.gz.
File metadata
- Download URL: kaggle_utils_dataset-0.1.0b0.tar.gz
- Upload date:
- Size: 6.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
14902e5b97177448d973c3104cbc841565597f8a9583e46c7ac519e387767158
|
|
| MD5 |
0da8d10c76da7a9e33a4cf77e80e3a15
|
|
| BLAKE2b-256 |
d311de9584b6fe78352f4a4964a0f1429b04c31affe09537c4100cc70e22dcb4
|
File details
Details for the file kaggle_utils_dataset-0.1.0b0-py3-none-any.whl.
File metadata
- Download URL: kaggle_utils_dataset-0.1.0b0-py3-none-any.whl
- Upload date:
- Size: 7.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ecf2a84aaea5e4c62b079fe36ce9d5eeb9c098d0331118c0f5c6d215b11368d
|
|
| MD5 |
7664b3c381cb6f7a67f07fb529c5882d
|
|
| BLAKE2b-256 |
19a45e354087d1cbd65b148b1e9f2d9697a5ae31036c593f05e152aa8a4da7be
|