Skip to main content

datasets_sql is an extension package of 🤗 Datasets package that provides support for executing arbitrary SQL queries on datasets.

Project description

datasets-sql

A 🤗 Datasets extension package that provides support for executing arbitrary SQL queries on Dataset objects. It uses DuckDB as a SQL engine and follows its query syntax.

Installation

pip install datasets-sql

Quick Start

from datasets import load_dataset, Dataset
from datasets_sql import query

imdb_dset = load_dataset("imdb", split="train")

# Remove the rows where the `text` field has less than 100 characters
imdb_query_dset1 = query("SELECT text FROM imdb_dset WHERE length(text) > 100")

# Count the number of rows per label
imdb_query_dset2 = query("SELECT label, COUNT(*) as num_rows FROM imdb_dset GROUP BY label")

# Remove duplicated rows
imdb_query_dset3 = query("SELECT DISTINCT text FROM imdb_dset")

order_customer_dset = Dataset.from_dict({
    "order_id": [10001, 10002, 10003],
    "customer_id": [3, 1, 2],
})

customer_dset = Dataset.from_dict({
    "customer_id": [1, 2, 3],
    "name": ["John", "Jane", "Mary"],
})

# Join two tables
join_query_dset = query(
    "SELECT order_id, name FROM order_customer_dset INNER JOIN customer_dset ON order_customer_dset.customer_id = customer_dset.customer_id"
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datasets-sql-0.1.1.tar.gz (9.0 kB view hashes)

Uploaded Source

Built Distribution

datasets_sql-0.1.1-py3-none-any.whl (9.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page