datasets_sql is an extension package of 🤗 Datasets package that provides support for executing arbitrary SQL queries on datasets.
Project description
datasets-sql
A 🤗 Datasets extension package that provides support for executing arbitrary SQL queries on Dataset
objects. It uses DuckDB as a SQL engine and follows its query syntax.
Installation
pip install datasets-sql
Quick Start
from datasets import load_dataset, Dataset
from datasets_sql import query
imdb_dset = load_dataset("imdb", split="train")
# Remove the rows where the `text` field has less than 100 characters
imdb_query_dset1 = query("SELECT text FROM imdb_dset WHERE length(text) > 100")
# Count the number of rows per label
imdb_query_dset2 = query("SELECT label, COUNT(*) as num_rows FROM imdb_dset GROUP BY label")
# Remove duplicated rows
imdb_query_dset3 = query("SELECT DISTINCT text FROM imdb_dset")
order_customer_dset = Dataset.from_dict({
"order_id": [10001, 10002, 10003],
"customer_id": [3, 1, 2],
})
customer_dset = Dataset.from_dict({
"customer_id": [1, 2, 3],
"name": ["John", "Jane", "Mary"],
})
# Join two tables
join_query_dset = query(
"SELECT order_id, name FROM order_customer_dset INNER JOIN customer_dset ON order_customer_dset.customer_id = customer_dset.customer_id"
)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
datasets-sql-0.1.1.tar.gz
(9.0 kB
view hashes)
Built Distribution
Close
Hashes for datasets_sql-0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3d4d15353b729336df0ba5deb8391314d2a25709437818e7476c2eaa7f7f627b |
|
MD5 | 0bbdd7c1b7fedd67785f921fd449ce3e |
|
BLAKE2b-256 | c5f3d3ba6f10e9304f57e54b8c2cc92382d8ed1cdcc35f7f22743a1268e85d49 |