Skip to main content

datasets_sql is an extension package of 🤗 Datasets package that provides support for executing arbitrary SQL queries on datasets.

Project description

datasets_sql

A 🤗 Datasets extension package that provides support for executing arbitrary SQL queries on HF datasets. It uses DuckDB as a SQL engine and follows its query syntax.

Installation

pip install datasets_sql

Quick Start

from datasets import load_dataset, Dataset
from datasets_sql import query

imdb_dset = load_dataset("imdb", split="train")

# Remove the rows where the `text` field has less than 1000 characters
imdb_query_dset1 = query("SELECT text FROM imdb_dset WHERE length(text) > 1000")

# Count the number of rows per label
imdb_query_dset2 = query("SELECT label, COUNT(*) as num_rows FROM imdb_dset GROUP BY label")

# Remove duplicated rows
imdb_query_dset3 = query("SELECT DISTINCT text FROM imdb_dset")

# Get the average length of the `text` field
imdb_query_dset4 = query("SELECT AVG(length(text)) as avg_text_length FROM imdb_dset")

order_customer_dset = Dataset.from_dict({
    "order_id": [10001, 10002, 10003],
    "customer_id": [3, 1, 2],
})

customer_dset = Dataset.from_dict({
    "customer_id": [1, 2, 3],
    "name": ["John", "Jane", "Mary"],
})

# Join two tables
join_query_dset = query(
    "SELECT order_id, name FROM order_customer_dset INNER JOIN customer_dset ON order_customer_dset.customer_id = customer_dset.customer_id"
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datasets_sql-0.4.0.tar.gz (9.8 kB view details)

Uploaded Source

Built Distribution

datasets_sql-0.4.0-py3-none-any.whl (9.4 kB view details)

Uploaded Python 3

File details

Details for the file datasets_sql-0.4.0.tar.gz.

File metadata

  • Download URL: datasets_sql-0.4.0.tar.gz
  • Upload date:
  • Size: 9.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for datasets_sql-0.4.0.tar.gz
Algorithm Hash digest
SHA256 d68fb0f5718d66ce6c9a5249400cefa69be0567ce13ce1514b10f8857036943f
MD5 a99d793bccf514e7618f82b7c1f9a852
BLAKE2b-256 776062f680f0b4aad9ccbbd08f0885d55eb1d214b21e86fd08d319ad76b36246

See more details on using hashes here.

File details

Details for the file datasets_sql-0.4.0-py3-none-any.whl.

File metadata

File hashes

Hashes for datasets_sql-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e039965fa6519cab5d8e93e685f081ee27f62b68c634cf92d9a5ebb13163cc30
MD5 436d534eafd5d45877dc8f793e9a5602
BLAKE2b-256 2f79faea9bde92a8ec329934998a56f6a68d25b60da39ef9af0a1d38619fe772

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page