Fast, efficient alternative to Hugging Face load_dataset using DuckDB for querying, sampling and transforming remote datasets

These details have not been verified by PyPI

Project links

Project description

Hypersets

Efficient SQL interface for HuggingFace datasets using DuckDB.

Hypersets is a library to work with massive datasets without downloading them entirely. Query terabytes of data using simple SQL while only downloading what you need.

Hypersets is currently in pre-alpha stage. Use at your own risk.

✨ Features

🚀 Fast metadata retrieval - Get dataset info without downloading
💾 Memory-only operation - No disk caching unless requested
🎯 Efficient querying - SQL interface with DuckDB optimization
📊 Download tracking - See exactly how much data you're saving
🧠 Smart caching - Avoid repeated API calls
🔄 Multiple formats - Output as pandas DataFrame or HuggingFace Dataset
⚡ Rate limit handling - Built-in exponential backoff for 429 errors
🛡️ Proper error handling - Clear exceptions for common issues

🚦 Validation Status

What has been tested and confirmed so far:

Dataset info retrieval: Fast YAML frontmatter parsing
Efficient querying: DuckDB SQL with HTTP optimization and 429 retry logic
Smart caching: 1000x+ speedup on repeated calls
Download tracking: 99.9% data savings demonstrated on real datasets (0.04GB on a 59GB dataset for simple operations)
Multiple formats: pandas DataFrame and HuggingFace Dataset support
Error handling: Proper exceptions and retry logic for production use
Memory efficiency: Handles TB-scale datasets in MBs or GBs of RAM and bandwidth

📦 Installation

pip install hypersets

🎯 Quick Start

import hypersets as hs

# Get dataset info without downloading
info = hs.info("omarkamali/wikipedia-monthly") 
print(f"Dataset size: {info.estimated_total_size_gb:.1f} GB")
print(f"Configs: {len(info.config_names)}")
print(f"Available configs: {info.config_names[:5]}")

# Query with SQL - only downloads what's needed
result = hs.query(
    "SELECT title, LENGTH(text) as text_length FROM dataset LIMIT 10",
    dataset="omarkamali/wikipedia-monthly",
    config="latest.en"
)

# Convert to pandas for analysis
df = result.to_pandas()
print(f"Retrieved {len(df)} articles")

🚀 Core API

Dataset Information

# Get comprehensive dataset metadata
info = hs.info("omarkamali/wikipedia-monthly")
print(f"Total files: {info.total_parquet_files}")
print(f"Size estimate: {info.estimated_total_size_gb:.1f} GB")

# List available configurations
configs = hs.list_configs("omarkamali/wikipedia-monthly")
print(f"Available configs: {configs[:10]}")  # First 10

# Clear cached metadata
hs.clear_cache()

SQL Querying

# Basic querying
result = hs.query(
    "SELECT title, url FROM dataset WHERE LENGTH(text) > 10000 LIMIT 100",
    dataset="omarkamali/wikipedia-monthly",
    config="latest.en"
)

# Aggregation queries
count = hs.count(
    dataset="omarkamali/wikipedia-monthly", 
    config="latest.en"
)
print(f"Total articles: {count:,}")

# Advanced analytics
stats = hs.query(
    """
    SELECT 
        COUNT(*) as total_articles,
        AVG(LENGTH(text)) as avg_length,
        MAX(LENGTH(text)) as max_length
    FROM dataset
    """,
    dataset="omarkamali/wikipedia-monthly",
    config="latest.en"
)

Sampling & Exploration

# Random sampling with DuckDB optimization
sample = hs.sample(
    n=1000,
    dataset="omarkamali/wikipedia-monthly",
    config="latest.en",
    columns=["title", "url", "LENGTH(text) as text_length"]
)

# Quick data preview
preview = hs.head(
    n=5,
    dataset="omarkamali/wikipedia-monthly", 
    config="latest.en",
    columns=["title", "url"]
)

# Schema inspection
schema = hs.schema(
    dataset="omarkamali/wikipedia-monthly",
    config="latest.en"
)
print(f"Columns: {[col['name'] for col in schema.columns]}")

Output Formats

result = hs.query("SELECT * FROM dataset LIMIT 100", ...)

# As pandas DataFrame
df = result.to_pandas()
print(df.head())

# As HuggingFace Dataset
hf_dataset = result.to_hf_dataset()
print(hf_dataset.features)

# Query result metadata
print(f"Shape: {result.shape}")
print(f"Columns: {result.columns}")

Download Tracking

# Enable download tracking to see data savings
result = hs.query(
    "SELECT title FROM dataset LIMIT 1000",
    dataset="omarkamali/wikipedia-monthly",
    config="latest.en",
    track_downloads=True
)

# Check savings
if result.download_stats:
    stats = result.download_stats
    print(f"Total dataset: {stats.total_dataset_size_gb:.1f} GB")
    print(f"Downloaded: {stats.estimated_downloaded_gb:.2f} GB")
    print(f"Savings: {stats.savings_percentage:.1f}%")

📁 Examples

Explore our comprehensive examples to see Hypersets in action:

🏃 Quick Demo

python examples/demo.py

Complete feature demonstration - Shows all Hypersets capabilities with real datasets.

📚 Basic Usage

python examples/basic_usage.py

Learn the fundamentals - Dataset info, querying, sampling, caching, and output formats.

🔬 Advanced Queries

python examples/advanced_queries.py

Sophisticated analytics - Text analysis, pattern matching, quality metrics, and performance optimization.

🏗️ Architecture

Hypersets consists of four core components:

Dataset Info Retriever - Discovers parquet files, configs, and schema from YAML frontmatter
DuckDB Mount System - Mounts remote parquet files as virtual tables with HTTP optimization
Query Interface - Clean API with SQL support, download tracking, and multiple output formats
Smart Caching - TTL-based caching of dataset metadata to avoid repeated API calls

All components include proper 429 rate limit handling with exponential backoff.

🔧 Advanced Configuration

Memory Management

# Configure DuckDB memory limit (default: 4GB)
result = hs.query(
    "SELECT * FROM dataset LIMIT 1000",
    dataset="large/dataset",
    memory_limit="8GB"  # Increase for large datasets
)

# For extremely large datasets
result = hs.query(
    "SELECT * FROM dataset LIMIT 10000", 
    dataset="massive/dataset",
    memory_limit="16GB",  # More memory
    threads=8             # More threads
)

# Memory-efficient column selection
result = hs.query(
    "SELECT id, title FROM dataset LIMIT 100000",  # Only select needed columns
    dataset="large/dataset",
    memory_limit="2GB"  # Can use less memory
)

Memory Limit Guidelines:

Default (4GB): Good for most datasets up to ~50GB
8GB: For large datasets (50-200GB) or complex queries
16GB+: For massive datasets (200GB+) or heavy aggregations
Column selection: Always select only needed columns for better memory efficiency

Custom Caching

# Cache with custom TTL (Time To Live)
info = hs.info("dataset", cache_ttl=3600)  # 1 hour

# Disable caching for fresh data
info = hs.info("dataset", use_cache=False)

Authentication

# Use HuggingFace token for private datasets
result = hs.query(
    "SELECT * FROM dataset LIMIT 10",
    dataset="private/dataset",
    token="hf_your_token_here"
)

Performance Tuning

# Optimize for your use case
result = hs.query(
    "SELECT * FROM dataset USING SAMPLE 10000",
    dataset="large/dataset",
    memory_limit="6GB",    # Adequate memory
    threads=4,            # Balanced parallelism  
    track_downloads=True  # Monitor efficiency
)

# For aggregation-heavy workloads
stats = hs.query(
    """
    SELECT 
        category,
        COUNT(*) as count,
        AVG(LENGTH(text)) as avg_length
    FROM dataset 
    GROUP BY category
    """,
    dataset="large/dataset",
    memory_limit="12GB",  # More memory for grouping
    threads=8            # More threads for aggregation
)

Contributing

Fork the repository
Create a feature branch: git checkout -b feature-name
Make changes and add tests
Run tests: pytest tests/
Submit a pull request

License

MIT License - see LICENSE file for details.

Acknowledgments

DuckDB for incredible SQL analytics on remote data
Parquet for the de facto standard for columnar data storage
HuggingFace for democratizing access to datasets
The open source community for inspiration and feedback

Contributors

Omar Kamali

📝 Citation

If you use Hypersets in your research, please cite:

@misc{hypersets,
    title={Hypersets: Efficient dataset transfer, querying and transformation},
    author={Omar Kamali},
    year={2025},
    url={https://github.com/omarkamali/hypersets}
    note={Project developed under Omneity Labs}
}

🚀 Ready to query terabytes of data efficiently? Start with examples/demo.py to see Hypersets in action!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.2

Jul 20, 2025

0.0.1

Jul 20, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hypersets-0.0.2.tar.gz (37.9 kB view details)

Uploaded Jul 20, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hypersets-0.0.2-py3-none-any.whl (31.8 kB view details)

Uploaded Jul 20, 2025 Python 3

File details

Details for the file hypersets-0.0.2.tar.gz.

File metadata

Download URL: hypersets-0.0.2.tar.gz
Upload date: Jul 20, 2025
Size: 37.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: python-httpx/0.28.1

File hashes

Hashes for hypersets-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`5de7da766477af0af48dd47b6a522c5dba899aef6b60444b6b145712ddac7b15`
MD5	`ef88ea85e394a4c8c8583b20de59cffa`
BLAKE2b-256	`3534fced14d37b98dd7ae30175d3c9e7332645433c8d91fd7939c9751f6597fa`

See more details on using hashes here.

File details

Details for the file hypersets-0.0.2-py3-none-any.whl.

File metadata

Download URL: hypersets-0.0.2-py3-none-any.whl
Upload date: Jul 20, 2025
Size: 31.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: python-httpx/0.28.1

File hashes

Hashes for hypersets-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fdc05a6c7805a40561886acbb6e6c471b4c99b24447c57ba636dc1f79e28f97a`
MD5	`c9ee6f4f38a0e53ad561ce9140c9b93b`
BLAKE2b-256	`4d187450baae94d3e24bcad56a3f629d5fe3d24fbd3017a6172395de0cc95072`

See more details on using hashes here.

hypersets 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Hypersets

✨ Features

🚦 Validation Status

📦 Installation

🎯 Quick Start

🚀 Core API

Dataset Information

SQL Querying

Sampling & Exploration

Output Formats

Download Tracking

📁 Examples

🏃 Quick Demo

📚 Basic Usage

🔬 Advanced Queries

🏗️ Architecture

🔧 Advanced Configuration

Memory Management

Custom Caching

Authentication

Performance Tuning

Contributing

License

Acknowledgments

Contributors

📝 Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes