datasets_sql is an extension package of 🤗 Datasets package that provides support for executing arbitrary SQL queries on datasets.
Project description
datasets_sql
A 🤗 Datasets extension package that provides support for executing arbitrary SQL queries on HF datasets. It uses DuckDB as a SQL engine and follows its query syntax.
Installation
pip install datasets_sql
Quick Start
from datasets import load_dataset, Dataset
from datasets_sql import query
imdb_dset = load_dataset("imdb", split="train")
# Remove the rows where the `text` field has less than 1000 characters
imdb_query_dset1 = query("SELECT text FROM imdb_dset WHERE length(text) > 1000")
# Count the number of rows per label
imdb_query_dset2 = query("SELECT label, COUNT(*) as num_rows FROM imdb_dset GROUP BY label")
# Remove duplicated rows
imdb_query_dset3 = query("SELECT DISTINCT text FROM imdb_dset")
# Get the average length of the `text` field
imdb_query_dset4 = query("SELECT AVG(length(text)) as avg_text_length FROM imdb_dset")
order_customer_dset = Dataset.from_dict({
"order_id": [10001, 10002, 10003],
"customer_id": [3, 1, 2],
})
customer_dset = Dataset.from_dict({
"customer_id": [1, 2, 3],
"name": ["John", "Jane", "Mary"],
})
# Join two tables
join_query_dset = query(
"SELECT order_id, name FROM order_customer_dset INNER JOIN customer_dset ON order_customer_dset.customer_id = customer_dset.customer_id"
)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
datasets_sql-0.4.0.tar.gz
(9.8 kB
view details)
Built Distribution
File details
Details for the file datasets_sql-0.4.0.tar.gz
.
File metadata
- Download URL: datasets_sql-0.4.0.tar.gz
- Upload date:
- Size: 9.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d68fb0f5718d66ce6c9a5249400cefa69be0567ce13ce1514b10f8857036943f |
|
MD5 | a99d793bccf514e7618f82b7c1f9a852 |
|
BLAKE2b-256 | 776062f680f0b4aad9ccbbd08f0885d55eb1d214b21e86fd08d319ad76b36246 |
File details
Details for the file datasets_sql-0.4.0-py3-none-any.whl
.
File metadata
- Download URL: datasets_sql-0.4.0-py3-none-any.whl
- Upload date:
- Size: 9.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e039965fa6519cab5d8e93e685f081ee27f62b68c634cf92d9a5ebb13163cc30 |
|
MD5 | 436d534eafd5d45877dc8f793e9a5602 |
|
BLAKE2b-256 | 2f79faea9bde92a8ec329934998a56f6a68d25b60da39ef9af0a1d38619fe772 |