A lightweight wrapper around HuggingFace datasets.

These details have not been verified by PyPI

Project links

Project description

dumb-datasets

A lightweight wrapper around HuggingFace datasets.

Features

🔄 Complete wrapper around HuggingFace datasets with extended functionality
🚀 Cached dataset loading with smart retries and error handling
🛠️ Rich helper functions for common dataset operations
📊 Streamlined data processing pipelines with fluent API
🔍 Type validation with Pydantic models
🔌 Extension points via hooks and adapters
📋 Feature definition and inference utilities
⚡ Ultra-fast downloads with HF Transfer enabled by default
🔗 Integrated HuggingFace Hub API for repository interactions

Installation

pip install dumb-datasets

Or with Poetry:

poetry add dumb-datasets

Usage

Loading Datasets

from dumb_datasets import load_dataset, set_api_token

# Optionally set your HuggingFace API token for private datasets
set_api_token("your_hf_token")

# Load a dataset
dataset = load_dataset("squad", split="train")

# Access information about the dataset
info = dataset.info()
print(f"Number of rows: {info['num_rows']}")

Fast Downloads with HF Transfer

dumb-datasets enables HF Transfer by default for ultra-fast downloads:

from dumb_datasets import load_dataset, enable_hf_transfer, download_file, download_repository

# HF Transfer is enabled by default, but you can control it:
enable_hf_transfer(True)  # Enable explicitly
# enable_hf_transfer(False)  # Disable if needed

# Download a specific file
file_path = download_file(
    repo_id="google/fleurs",
    filename="README.md",
    repo_type="dataset"
)

# Download an entire repository
repo_path = download_repository(
    repo_id="google/fleurs",
    repo_type="dataset"
)

Hub API Integration

from dumb_datasets import HubAPI

# Create a Hub API instance
hub = HubAPI(token="your_hf_token")  # Token is optional

# List available datasets
datasets = hub.list_datasets()
for ds in datasets[:5]:
    print(f"Dataset: {ds['id']}")

# Upload a file to a repository
url = hub.upload_file(
    path_or_fileobj="path/to/file.csv",
    path_in_repo="data/file.csv",
    repo_id="your-username/your-repo",
    repo_type="dataset"
)

Using Sessions

Sessions help manage configuration across multiple operations:

from dumb_datasets import Session

# Create a session with your preferences
session = Session(
    cache_dir="/path/to/cache",
    api_token="your_hf_token",
    force_hf_transfer=True  # Enable HF Transfer (default)
)

# Use the session to load datasets and interact with the Hub
dataset = session.get_dataset("squad", split="train")
file_path = session.download_file("google/fleurs", "README.md")

Quick Usage

from dumb_datasets import load_dataset, Features, Value

# Load a dataset with automatic caching and error handling
dataset = load_dataset("squad")

# Get dataset info
info = dataset.info()
print(f"Dataset has {info['num_rows']} rows with features: {info['features']}")

# Apply transformations with a fluent API
processed = (dataset
    .filter(lambda x: len(x["question"]) > 10)
    .map_columns(lambda x: x.lower(), ["question", "context"])
    .shuffle(seed=42))

# Define custom features
features = Features({
    "text": Value("string"),
    "label": Value("int64")
})

# Use session for consistent settings
from dumb_datasets import Session
session = Session(cache_dir="/tmp/datasets", api_token="YOUR_HF_TOKEN")
new_dataset = session.get_dataset("glue", name="mnli")

Advanced Usage

from dumb_datasets import (
    Dataset,
    ClassLabel,
    infer_features_from_dict,
    save_dataset_sample
)

# Infer features from examples
example = {"text": "Hello world", "score": 0.95, "labels": ["positive", "greeting"]}
features = infer_features_from_dict(example)

# Save samples for inspection
save_dataset_sample(dataset, "samples.json", num_examples=5)

# Register an adapter for custom dataset loading
from dumb_datasets import register_adapter
register_adapter("my_format", my_custom_loader_function)

# Use hooks for custom processing
from dumb_datasets import add_hook
add_hook("after_load", lambda ds: print(f"Loaded dataset with {len(ds)} examples"))

Distributed Data Generation

The library provides an opinionated API for distributed data generation workflows:

from dumb_datasets import push_intermediate_data, merge_intermediate_data

# === WORKER PROCESS ===
# Push partial data to a "intermediates" branch with date-based organization
url = push_intermediate_data(
    local_path="worker_data.jsonl",  # Local JSONL file to upload
    repo_id="your-username/your-dataset",
    # Optional params with defaults shown:
    prefix="intermediates",  # Folder within the branch
    date_folder=True,       # Create YYYYMMDD subfolder
)
print(f"Uploaded intermediate data: {url}")
# Each worker gets a stable ID and files are named to avoid collisions

# === AGGREGATOR PROCESS ===
# Define custom deduplication function (optional)
def dedup_by_id(row):
    return row.get("id")  # Use id field as deduplication key

# Merge all intermediate data files
result = merge_intermediate_data(
    repo_id="your-username/your-dataset",
    # Optional params with defaults shown:
    prefix="intermediates",
    aggregator_branch="aggregator_output",  # Branch for merged results
    push_to_main=True,       # Also push to main branch
    deduplicate=True,        # Remove duplicate rows
    dedup_key=dedup_by_id,   # Custom key function (default: entire row)
    remember_merged=True,    # Track processed files to avoid reprocessing
)

print(f"Merged {result['files_processed']} files with {result['rows_processed']} rows")
print(f"Output file: {result['output_file']}")

This API standardizes how distributed data generation processes can:

Push partial files from multiple workers to a single branch
Organize uploads by date and worker ID to prevent collisions
Merge and deduplicate data in a separate aggregator process
Track which files have been processed to enable incremental merges

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.2

Mar 25, 2025

0.0.1a0 pre-release

Mar 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dumb_datasets-0.0.2.tar.gz (23.7 kB view details)

Uploaded Mar 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dumb_datasets-0.0.2-py3-none-any.whl (26.7 kB view details)

Uploaded Mar 25, 2025 Python 3

File details

Details for the file dumb_datasets-0.0.2.tar.gz.

File metadata

Download URL: dumb_datasets-0.0.2.tar.gz
Upload date: Mar 25, 2025
Size: 23.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.7.1 CPython/3.11.11 Linux/6.8.0-1021-azure

File hashes

Hashes for dumb_datasets-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`0b2fc316d64de85fdab5b24a3d12eecb837844ee85b99143c768e9c0a780d4da`
MD5	`7a9a3060829cc1e7fc38a385a414e7f7`
BLAKE2b-256	`018e50080d47f01e4529d11dce2ee3011485178b69633ea90c8cd307e553fef8`

See more details on using hashes here.

File details

Details for the file dumb_datasets-0.0.2-py3-none-any.whl.

File metadata

Download URL: dumb_datasets-0.0.2-py3-none-any.whl
Upload date: Mar 25, 2025
Size: 26.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.7.1 CPython/3.11.11 Linux/6.8.0-1021-azure

File hashes

Hashes for dumb_datasets-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cdeb5d2c7da57e1a10439d38fbf28d2bfd541fb83dcf6147899de6ab568c26dc`
MD5	`c5d5a0512f7c75a69b9ece6761b46137`
BLAKE2b-256	`b9c6f110c515cec546cde0bdd0565071a5fbe6793b257adba3344e700de1024a`

See more details on using hashes here.

dumb-datasets 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

dumb-datasets

Features

Installation

Usage

Loading Datasets

Fast Downloads with HF Transfer

Hub API Integration

Using Sessions

Quick Usage

Advanced Usage

Distributed Data Generation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes